Add link archive system: snapshots, backlinks, link-rot

Preserve external works the site cites against link rot, host them at
permanent /archive/<slug>/ URLs in site chrome, and treat them as
first-class citizens of the backlinks and similar-pages indexes.
Curated, not crawled: the author adds one line to archive/manifest.yaml
and the build fetches, hashes, snapshots, and indexes the work.

* archive/manifest.yaml + tools/archive.py (fetch / refresh / wayback /
  check / gc) — PDFs downloaded directly, HTML pages snapshotted with a
  vendored monolith (tools/bin/monolith @ 2.10.1) into a single
  self-contained file with the archive CSP and a noarchive robots meta
  injected. Per-entry PROVENANCE.json committed; gitignored .txt
  sidecars regenerated from the artifact's SHA-256.
* build/Archive.hs + build/ArchiveIndex.hs + build/Filters/Archive.hs
  — Hakyll rules for /archive/ and /archive/<slug>/, a body Pandoc
  filter that appends an archive affordance to live citations and
  flips dead ones to the local copy on archive.py check's asymmetric
  hysteresis (rotted needs 3 fails over >= 14 days; one ok recovers).
* build/Backlinks.hs — keeps archived external URLs through pass 1 and
  canonicalises them to /archive/<slug>/ in pass 2, producing a
  "Referenced by" section grouped by the fragment each citation
  targets. build/Stats.hs gains a "Link archive" telemetry block on
  /build/ (count, total size, median age, by-status / by-quality /
  by-visibility, orphans).
* Integrity: archive.py fetch and build/Archive.hs (via sha256sum)
  both re-hash every committed artifact, so a tampered file halts the
  build even with cabal invoked directly or no .venv present. refresh
  refuses to replace an uncommitted prior snapshot and rolls back
  atomically on any exit path. removed.yaml is honoured by fetch,
  wayback, and check using canonical-form (tracking-stripped,
  arXiv-canonicalised) comparison.
* visibility: private keeps an entry in-repo but undeployed.
  nginx/archive.conf emits X-Robots-Tag: noindex, noarchive for raw
  artifacts that cannot carry meta directives.

The full design, phase plan (1-5), and three refinement passes live
in ARCHIVE.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Levi Neuwirth 2026-05-23 10:06:33 -04:00
parent 14c881b9e4
commit 77e31efdae
31 changed files with 5127 additions and 40 deletions

10
.gitignore vendored
View File

@ -69,10 +69,20 @@ data/similar-links.json
data/backlinks.json
data/build-stats.json
data/build-start.txt
data/build-stamp.txt
data/last-build-seconds.txt
data/semantic-index.bin
data/semantic-meta.json
# Archive: generated text + its staleness stamp (recreated from the
# committed artifact on every build — deterministic, so committing them is
# churn). archive/**/PROVENANCE.json is deliberately NOT ignored — it is
# the committed, immutable record of each archival event.
archive/**/*.txt
archive/**/*.txt.sha256
data/archive-index.json
data/archive-state.json
# IGNORE.txt is for the local build and need not be synced.
IGNORE.txt

1535
ARCHIVE.md Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.PHONY: build deploy sign download-model download-pdfjs download-leaflet compress-assets convert-images pdf-thumbs pdfs watch clean dev
.PHONY: build deploy sign download-model download-pdfjs download-leaflet compress-assets convert-images pdf-thumbs pdfs watch clean dev archive-gc archive-wayback archive-check
# Source .env for deploy / GitHub config if it exists.
# .env format: KEY=value (one per line, no `export` prefix, no quotes needed).
@ -43,6 +43,16 @@ build:
else \
echo "Photography sidecars skipped: run 'uv sync' to enable EXIF + palette + dimension extraction (build continues with frontmatter only)"; \
fi
# Archive pipeline (Phase 1): fetch any manifest URL without a local
# artifact, extract text, write archive/<slug>/PROVENANCE.json and
# data/archive-index.json. Gated on .venv, same as embed.py. A SHA or
# slug-URL integrity error exits non-zero and halts the build; a
# transient network failure is non-fatal (the entry retries next build).
@if [ -d .venv ]; then \
uv run python tools/archive.py fetch; \
else \
echo "Archive fetch skipped: run 'uv sync' to enable link archiving (build continues)"; \
fi
cabal run site -- build
pagefind --site _site
@if [ -d .venv ]; then \
@ -153,6 +163,38 @@ watch:
clean:
cabal run site -- clean
# Evict archived works: delete archive/<slug>/ directories whose slug is
# recorded in archive/removed.yaml. Opt-in — NEVER run by `make build`.
# Orphan directories (not in manifest.yaml, not in removed.yaml) are
# reported, never deleted. See ARCHIVE.md - Eviction & removal.
archive-gc:
@if [ -d .venv ]; then \
uv run python tools/archive.py gc; \
else \
python3 tools/archive.py gc; \
fi
# Submit archived URLs to the Wayback Machine and backfill the capture URL
# into each PROVENANCE.json. A slow network job — opt-in, never run by
# `make build`. Always exits 0; an entry without a capture retries next run.
archive-wayback:
@if [ -d .venv ]; then \
uv run python tools/archive.py wayback; \
else \
python3 tools/archive.py wayback; \
fi
# Probe every archived URL for link rot, updating data/archive-state.json.
# A slow network job — opt-in, never run by `make build`. Asymmetric
# hysteresis: `rotted` needs 3 consecutive failures over >=14 days; a
# single success recovers immediately. The next build consumes the state.
archive-check:
@if [ -d .venv ]; then \
uv run python tools/archive.py check; \
else \
python3 tools/archive.py check; \
fi
# Dev build includes any in-progress drafts under content/drafts/essays/.
# SITE_ENV=dev is read by build/Site.hs; drafts are otherwise invisible to
# every build (make build / make deploy / cabal run site -- build directly).

View File

@ -0,0 +1,14 @@
{
"url": "https://cr.yp.to/aes-speed.html",
"slug": "djb-aes-speed",
"title": "Cache-timing attacks on AES (cr.yp.to)",
"type": "html",
"artifact": "snapshot.html",
"sha256": "8da2d5aedeccf9f602e1680631aa77308683803c0cc9b04caad52c7a70c60832",
"previous-sha256": "0a50bf6d64b2ec08771d83be5ef47721ecbfc431e3512ff55978e76f452dbd3f",
"bytes": 26186,
"archived": "2026-05-23",
"source-date": null,
"snapshot-quality": "ok",
"wayback": null
}

View File

@ -0,0 +1,470 @@
<!-- Saved from https://cr.yp.to/aes-speed.html at 2026-05-23T13:04:33Z using monolith v2.10.1 -->
<html><head><meta content="default-src 'none'; img-src data:; style-src 'unsafe-inline'; style-src-elem 'unsafe-inline'; style-src-attr 'unsafe-inline'; font-src data:; script-src 'none'; object-src 'none'; frame-src 'none'" http-equiv="Content-Security-Policy"/><meta content="noindex, noarchive" name="robots"/><link href="data:text/html;base64,PGh0bWw+PGJvZHk+ZmlsZSBkb2VzIG5vdCBleGlzdDwvYm9keT48L2h0bWw+DQo=" rel="icon"/></head><body>
<title>AES speed</title>
<meta content="aes" name="keywords"/>
<a href="https://cr.yp.to/djb.html">D. J. Bernstein</a>
<br/><a href="https://cr.yp.to/hash.html">Hash functions and ciphers</a>
<h1>AES speed</h1>
<b>Update:</b>
Peter Schwabe and I now have a paper on this topic:
<ul>
<li>
<a name="aesspeed-paper">[aesspeed]</a>
15pp.
<a href="https://cr.yp.to/aes-speed/aesspeed-20080926.pdf">(PDF)</a>
D. J. Bernstein, Peter Schwabe.
New AES software speed records.
Document ID: b90c51d2f7eef86b78068511135a231f.
URL: https://cr.yp.to/papers.html#aesspeed.
Date: 2008.09.26.
Supersedes:
<a href="https://cr.yp.to/aes-speed/aesspeed-20080908.pdf">(PDF)</a>
2008.09.08.
</li></ul>
The software is now available as part of the
<a href="https://cr.yp.to/streamciphers/timings.html#toolkit-estreambench">estreambench</a>
toolkit.
We have placed the software into the public domain;
feel free to integrate it into your own AES applications!
<p>
Information below this line has not yet been updated.
</p><hr/>
This document describes various speedups in AES software.
This document assumes that
the software is going to be used in an application
where timing information is <i>not</i> exposed to attackers.
<p>
The reader is expected to already know the standard structure of AES software:
</p><ul>
<li>each of the 16 state bytes is used as an index for a table lookup producing a 32-bit word;
</li><li>16 xors combine these 16 words and 4 expanded key words into 4 new state words;
</li><li>those 4 words are viewed as the starting 16 bytes for the next round.
</li></ul>
See Section 5.2.1 of "AES Proposal: Rijndael" by Daemen and Rijmen.
<h2>Endianness</h2>
On a little-endian CPU,
extracting the first byte of a 32-bit word
is an &amp;0xff arithmetic instruction;
on a big-endian CPU,
extracting the first byte of a 32-bit word
is a &gt;&gt;24 arithmetic instruction.
Similar comments apply to the other bytes.
<p>
One can write AES software
that uses arithmetic instructions as if the CPU were little-endian.
If the CPU is actually big-endian,
the software swaps the bytes of the AES key, input, and output (at run time).
The software also swaps the bytes of the table (at compile time),
for example by expressing the table as a sequence of 32-bit integers.
</p><p>
<b>Matched endianness.</b>
One can easily eliminate the byte-swapping time for the AES key, input, and output:
simply use the appropriate arithmetic instructions
for the endianness of the CPU.
In this case the table must not be swapped.
</p><h2>Table structure</h2>
All else being equal, smaller AES tables are faster:
they take less time to load into cache and are more likely to stay in cache.
Beware that most benchmarking tools preload caches and thus can't see this speedup.
<p>
Daemen and Rijmen suggest "4 KBytes of tables."
There are 4 tables.
Each table has 256 words occupying 1024 bytes.
The loads are spread evenly across the tables.
</p><p>
<b>Rotated lookups.</b>
Daemen and Rijmen suggest an alternative "with a total table size of 1KByte"
but with extra arithmetic.
The point is that the tables are rotations of each other:
for example,
the first word of the first table is (0xc6,0x63,0x63,0xa5),
the first word of the second table is (0xa5,0xc6,0x63,0x63),
the first word of the third table is (0x63,0xa5,0xc6,0x63),
and the first word of the fourth table is (0x63,0x63,0xa5,0xc6).
One can store the first table,
and simulate a lookup in another table at the cost of an extra rotation.
</p><p>
<b>Unaligned loads.</b>
One can instead use a single 2KB table having 256 8-byte entries
such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6).
There are many reasonable choices of pattern here;
what's important is that the pattern includes the desired
(0xc6,0x63,0x63,0xa5) and (0xa5,0xc6,0x63,0x63) and so on as substrings.
On the Pentium, the PowerPC, et al.,
one can load 4-byte words from memory addresses that aren't divisible by 4,
and there's no penalty when the word doesn't cross an 8-byte boundary.
</p><h2>Masked loads</h2>
16 of the 160 table lookups in 10-round AES are masked.
The 40 table lookups in 10-round AES key expansion are also masked.
The masks are 0x000000ff, 0x0000ff00, 0x00ff0000, and 0xff000000, each used equally often.
<p>
The simplest way to compute a mask is with an arithmetic instruction: for example, &amp;0xff00.
</p><p>
<b>Byte loads.</b>
One can eliminate 25% of the masks,
namely the bottom-byte masks,
by combining them with load instructions.
All popular CPUs have single-byte-load instructions.
</p><p>
<b>Two-byte loads.</b>
One can eliminate another 25% of the masks
on CPUs with two-byte-load instructions.
This constrains the table pattern:
it's important to have (0x00,0x63) on little-endian CPUs,
and (0x63,0x00) on big-endian CPUs.
</p><p>
<b>Masked tables.</b>
One can eliminate all of the masks by precomputing masked tables, using extra table space.
The simplest table structure uses a total of 8KB.
Two tables, one with entries such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6)
and another with entries such as (0x00,0x00,0x00,0x00,0x63,0x00,0x00,0x00),
use a total of 4KB.
In my experience,
the cost of larger tables outweighs the benefit of eliminating a few masks.
</p><h2>Key expansion</h2>
A 4-word (128-bit) key is expanded in 40 steps.
Each step produces a new word, totalling 44 words in the expanded key.
A step has a byte extraction (see below), a masked load, and two xors.
The total work is 40 byte extractions, 40 masked loads, and 80 xors.
For comparison, the subsequent work to encrypt a block involves
160 byte extractions, 160 loads (of which 16 are masked), and 160 xors.
<p>
Daemen and Rijmen say (Section 4.3.2)
that key expansion involves "almost no computational overhead."
Obviously key expansion is less expensive than encrypting a block.
On the other hand, the cost of key expansion is still quite noticeable.
</p><p>
<b>Expanded keys.</b>
A typical AES implementation precomputes and stores an expanded key.
The 40 byte extractions, 40 masked loads, and 80 xors aren't repeated for every block;
they are done only once, along with 44 stores.
Each block then involves 44 extra loads for the expanded key.
Some stores and loads can be eliminated
if many blocks are handled at once
and some extra registers are available.
</p><p>
Long-term storage of an expanded key can slow down applications that handle many keys:
the expanded keys take more time to load into cache
than the original keys and are less likely to stay in cache.
</p><p>
<b>Partially expanded keys.</b>
An alternative is to precompute and store a partially expanded key,
only 14 words instead of 44 words.
The partially expanded key consists of words
0, 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40 from the expanded key.
Loading the partially expanded key, and converting it into the fully expanded key,
takes only 14 loads and 30 xors.
</p><p>
One can interpolate between partial expansion and full expansion,
using various amounts of storage per key and achieving various balances between load and xor.
</p><h2>Index extraction</h2>
The 16 xor operations in an AES round
produce 4 words in 4 integer registers.
The 16 bytes of these words are then extracted and used as indices for the next round.
<p>
The simplest way to extract 4 bytes is using 6 instructions,
namely 3 shifts and 3 bottom-byte extractions:
&amp;255;
(&gt;&gt;8)&amp;255;
(&gt;&gt;16)&amp;255;
&gt;&gt;24.
</p><p>
Using a byte as an index then requires multiplying the byte by a constant
that depends on the table structure.
Let's assume the 2KB tables described above; then the constant is 8.
The multiplications use 4 shifts:
&lt;&lt;3;
&lt;&lt;3;
&lt;&lt;3;
&lt;&lt;3.
</p><p>
<b>Scaled-index loads.</b>
Many CPUs can multiply an index register by 8 for free as part of a load.
</p><p>
<b>Scaled-index extractions.</b>
What about CPUs that can't multiply an index register by 8 for free?
Two of the multiplications can nevertheless be eliminated,
because they can be combined with shifts.
The overall extract-and-scale sequence has 8 instructions:
(&lt;&lt;3)&amp;2040;
(&gt;&gt;5)&amp;2040;
(&gt;&gt;13)&amp;2040;
(&gt;&gt;21)&amp;2040.
The PowerPC has a combined rotate-and-mask instruction,
making this sequence take only 4 instructions.
</p><p>
<b>Scaled tables.</b>
One can rotate table entries by 3 bits,
reducing the above 8 instructions to 7 instructions.
</p><p>
<b>Second-byte instructions.</b>
The x86 architecture (Pentium, Athlon, etc.)
includes a combined (&gt;&gt;8)&amp;255 instruction.
This means that extracting 4 bytes takes only 5 instructions:
&amp;255;
(&gt;&gt;8)&amp;255;
&gt;&gt;16;
&amp;255;
&gt;&gt;8.
Alternate 5-instruction sequence:
&amp;255;
(&gt;&gt;8)&amp;255;
&gt;&gt;16;
&amp;255;
(&gt;&gt;8)&amp;255.
</p><p>
Of course, the ultimate measure of performance is a cycle count, not an instruction count.
Matsui states that the (&gt;&gt;8)&amp;255; instruction is "a bit expensive"
on the Pentium 4 Prescott (f33, f34, f41);
presumably this means that the instruction takes more cycles than, e.g., a mere &amp;255.
But all of the measurements I've seen indicate the opposite.
I'm not sure what I'm missing here.
</p><p>
<b>32-bit shifts on 64-bit architectures.</b>
The amd64 architecture (P4E, Athlon 64, Core 2, etc.) can right-shift a 64-bit register,
but Matsui comments that this operation is extremely slow on the P4E.
It's much better to use the amd64's x86-compatible right-shift instruction;
this instruction sets the top 32 bits of its 64-bit input to 0 before shifting.
</p><p>
<b>Byte extraction via loads.</b>
A completely different way to extract 4 bytes is with 1 store and 4 loads.
One can mix this with the previous approaches
to achieve various balances between load and arithmetic.
</p><p>
Consider, for example, the UltraSPARC,
which has 2 integer units and 1 load/store unit.
A traditional sequence of
14 partially-expanded-key loads (see below), 30 key-expansion xors,
160 scaled-index extractions, 160 table-lookup loads, 160 xors, 16 masks,
4 input loads, and 4 output stores
occupies a total of 526 integer instructions (at least 263 cycles)
and 182 loads (at least 182 cycles).
Using loads for some byte extractions,
replacing 36 scaled-index extractions with 9 stores and 36 loads,
means a total of 454 integer instructions (at least 227 cycles)
and 227 loads/stores (at least 227 cycles).
</p><h2>Unrolling</h2>
A typical 9-iteration AES loop
involves 9 increments of a loop index, 9 comparisons, and 9 branches,
one of which is mispredicted on most CPUs.
The loop index also consumes a register,
forcing an extra 9 stores and 9 loads on CPUs that don't have registers to spare.
<p>
<b>Full unrolling.</b>
One can eliminate all of these costs by fully unrolling the loop.
Beware, however, that full unrolling costs a few kilobytes of code-cache space.
</p><p>
<b>Partial unrolling.</b>
CPUs are more likely to correctly predict a 4-iteration loop than a 9-iteration loop.
</p><h2>Instruction scheduling</h2>
The 16 table lookups in an AES round are independent
and can be scheduled in many different ways.
One can, for example,
perform all the table lookups for the first input from bottom byte to top
(outputs 0, 3, 2, 1),
then perform all the table lookups for the second input from bottom byte to top
(outputs 1, 0, 3, 2),
then perform all the table lookups for the third input from bottom byte to top
(outputs 2, 1, 0, 3),
then perform all the table lookups for the fourth input from bottom byte to top
(outputs 3, 2, 1, 0).
One can, as another example,
first perform all the table lookups for the first output in order of the inputs,
then perform all the table lookups for the second output in order of the inputs,
etc.
<p>
<b>Maximum parallelism.</b>
The overall depth of the AES round is
one byte extraction plus one table lookup plus two xors:
a mythical CPU offering extensive parallelism
could perform all sixteen byte extractions in parallel,
then all sixteen table lookups in parallel,
then eight xors in parallel,
then four xors in parallel.
Note that each output is obtained by xor'ing two parallel xor's,
rather than by three serial xor's.
</p><p>
<b>Deferring loads.</b>
The amd64 architecture poses several challenges to AES instruction scheduling.
First,
most integer instructions require the output register to be one of the input registers.
Second,
typical amd64 CPUs handle a load and xor most efficiently as a unified load-xor,
but a unified load-xor gives no opportunity to switch registers.
Third,
only 4 registers (eax, ebx, ecx, edx) allow second-byte instructions.
</p><p>
Matsui concludes that, on amd64 (and x86),
keeping each round's inputs y0, y1, y2, y3 and outputs z0, z1, z2, z3 in eax, ebx, ecx, edx,
to allow second-byte instructions,
is "impossible without saving/restoring."
But that's incorrect.
No extra copies are required.
A careful instruction sequence
uses the minimal conceivable number of instructions:
20 for byte extraction,
16 for table lookups,
and 4 for handling the expanded key.
The idea is to extract all the bytes from an input,
freeing the input's register for an output,
before doing any table lookups involving that output:
</p><ul>
<li>Extract the 4 bytes from y0.
At this point y1, y2, y3, and the 4 bytes are live.
</li><li>Feed 1 byte into z0.
At this point y1, y2, y3, z0, and 3 more bytes are live.
</li><li>Extract the 4 bytes from y1, immediately feeding 1 into z0.
At this point y2, y3, z0, and 6 more bytes are live.
</li><li>Feed 2 bytes into z1.
At this point y2, y3, z0, z1, and 4 more bytes are live.
</li><li>Extract the 4 bytes from y2, immediately feeding 2 into z0 and z1.
At this point y3, z0, z1, and 6 more bytes are live.
</li><li>Feed 3 bytes into z2.
At this point y3, z0, z1, z2, and 3 more bytes are live.
</li><li>Extract the 4 bytes from y3, immediately feeding 3 into z0, z1, and z2.
At this point z0, z1, z2, and 4 more bytes are live.
</li><li>Feed 4 bytes into z3.
At this point z0, z1, z2, and z3 are live.
</li><li>Handle 4 words of the expanded key.
</li></ul>
The maximum number of live registers here is 9,
fitting easily into the amd64 instruction set.
<p>
<b>Squeezing inputs and outputs into 7 32-bit registers.</b>
The x86 architecture poses an additional challenge to AES instruction scheduling:
there are only 7 general-purpose integer registers.
</p><p>
It's still possible to handle a round with 0 stores, 4 expanded-key loads,
and 16 loads for table lookups.
The shortest instruction sequence that I know has a total of 46 instructions,
6 more than what would be possible with extra registers;
1 of the 46 instructions can be eliminated if the key expansion is changed.
</p><p>
The idea of this instruction sequence
is to rotate y0 by 16 bits,
use the bottom two bytes of both y0 and y2,
and then merge the remaining four bytes of y0 and y2 into a single register
(for example, shifting y0 down 16 bits, masking y1, and adding the results),
freeing a register at the cost of 3 extra instructions (the rotate, the mask, and the add);
splitting 3 load-xor instructions into 3 loads and 3 xors
then easily puts all outputs into suitable registers.
The rotation can be eliminated if the expanded-key word that corresponds to y0
is rotated by 16 bits.
</p><h2>Speed reports</h2>
Speed reports vary in whether they use CTR, CBC, etc.,
and in the exact rules for measuring speeds.
The "eSTREAM" cycles/byte counts are
for counter-mode AES measured by the eSTREAM benchmarking toolkit;
future implementors are encouraged to support the eSTREAM interface for direct comparability.
<table border="">
<tbody><tr><th>Architecture</th><th>CPU</th><th>eSTREAM cycles/byte</th><th>Ad-hoc cycles/byte</th><th>Software</th></tr>
<tr><td>amd64</td><td>Intel Core 2 Duo (6f6)?</td><td></td><td>9.2</td><td>Matsui/Nakajima (CHES 2007)</td></tr>
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)?</td><td></td><td>10.625 (170/block)</td><td>Matsui (FSE 2006)</td></tr>
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)?</td><td></td><td>12.4375 (199/block)</td><td>Lipmaa</td></tr>
<tr><td>amd64</td><td>Intel Core 2 Duo (6f6); katana</td><td>12.56</td><td></td><td>hongjun/v1/1</td></tr>
<tr><td>amd64</td><td>Intel Core 2 Quad Q6600 (6fb); latour</td><td>12.57</td><td></td><td>hongjun/v1/1</td></tr>
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)?</td><td></td><td>13.125 (210/block)</td><td>Osvik</td></tr>
<tr><td>amd64</td><td>AMD Athlon 64 X2 (15,75,2); mace</td><td>13.32</td><td></td><td>hongjun/v1/1</td></tr>
<tr><td>amd64</td><td>AMD Opteron 240 (f58); nmisles8amd64</td><td>13.45</td><td></td><td>bernstein/amd64-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium III (68a)?</td><td></td><td>14 (224/block)</td><td>Osvik</td></tr>
<tr><td>x86</td><td>AMD Athlon (622)?</td><td></td><td>14.0625 (225/block)</td><td>Osvik</td></tr>
<tr><td>x86</td><td>Intel Pentium III (68a)?</td><td></td><td>14.125 (226/block)</td><td>Lipmaa</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f12)?</td><td></td><td>15 (240/block)</td><td>Osvik</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f12)?</td><td></td><td>15.875 (254/block)</td><td>Lipmaa</td></tr>
<tr><td>x86</td><td>Intel Pentium M (695); whisper</td><td>15.96</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>amd64</td><td>Intel Pentium 4 (f64)?</td><td></td><td>16 (256/block)</td><td>Matsui (FSE 2006)</td></tr>
<tr><td>x86</td><td>Intel Pentium III (68a)?</td><td></td><td>16.25 (260/block)</td><td>Gladman</td></tr>
<tr><td>amd64</td><td>Intel Pentium D (f64); nmi0161</td><td>16.74</td><td></td><td>bernstein/amd64-2/1</td></tr>
<tr><td>amd64</td><td>Intel Pentium D (f64); svlin001</td><td>16.75</td><td></td><td>bernstein/amd64-2/1</td></tr>
<tr><td>amd64</td><td>Intel Xeon (f41); nmi0056</td><td>16.75</td><td></td><td>bernstein/amd64-2/1</td></tr>
<tr><td>amd64</td><td>Intel Xeon (f4a); nmi0090</td><td>16.77</td><td></td><td>bernstein/amd64-2/1</td></tr>
<tr><td>sparc</td><td>Sun UltraSPARC III</td><td></td><td>16.875 (270/block)</td><td>Lipmaa</td></tr>
<tr><td>amd64</td><td>Intel Xeon (f41); nmi0057</td><td>16.89</td><td></td><td>bernstein/amd64-2/1</td></tr>
<tr><td>amd64</td><td>Intel Pentium D (f64); speed</td><td>16.90</td><td></td><td>bernstein/amd64-2/1</td></tr>
<tr><td>amd64</td><td>Intel Pentium D (f64); nmi0104</td><td>16.90</td><td></td><td>bernstein/amd64-2/1</td></tr>
<tr><td>amd64</td><td>Intel Pentium D (f64); nmi0241</td><td>16.93</td><td></td><td>bernstein/amd64-2/1</td></tr>
<tr><td>ppc64</td><td>IBM POWER5; nmi0154</td><td>16.93</td><td></td><td>bernstein/big-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f24); nmi0086</td><td>16.96</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f12); fireball</td><td>16.98</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f24); nmitest4</td><td>17.01</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>ppc64</td><td>IBM PowerPC G5 970; nmi0048</td><td>17.17</td><td></td><td>bernstein/big-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium 2 (652); boris</td><td>17.33</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium 3 (68a)</td><td>17.49</td><td></td><td>Bernstein aes-128/x86-mmx-1</td></tr>
<tr><td>x86</td><td>Intel Pentium 3 (672); orpheus</td><td>17.55</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium M (6d8)</td><td>17.57</td><td></td><td>Wu v0/1</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f33)?</td><td></td><td>17.75 (284/block)</td><td>Matsui/Fukuda (FSE 2005)</td></tr>
<tr><td>x86</td><td>Intel Xeon (f29); nmibuild40</td><td>17.79</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f27); nmi0059</td><td>17.79</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild16</td><td>17.79</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f25); nmi0013</td><td>17.79</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f29); nmi0059</td><td>17.80</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f29); nmibuild17</td><td>17.81</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild15</td><td>17.82</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild26</td><td>17.83</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild21</td><td>17.83</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f25); nmi0036</td><td>17.84</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild22</td><td>17.84</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>AMD Athlon (622); thoth</td><td>18.38</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>ppc32</td><td>IBM POWER4; nmibuild14</td><td>18.55</td><td></td><td>bernstein/little-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f41); nmi0079</td><td>18.88</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f41); nmi0062</td><td>18.89</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>amd64</td><td>Intel Core 2 Duo (6f6)</td><td></td><td>18.9</td><td>OpenSSL 0.9.8e</td></tr>
<tr><td>x86</td><td>Intel Xeon (f41); nmi0061</td><td>18.91</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f41); svlin002</td><td>18.94</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f41); nmi0076</td><td>18.96</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f4a); nmi0102</td><td>18.97</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f41); nmi0060</td><td>18.97</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Xeon (f41); nmi0063</td><td>18.95</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium 3 (68a)</td><td>19.06</td><td></td><td>Wu v1/1</td></tr>
<tr><td>ppc32</td><td>Motorola PowerPC G4 7410; gggg</td><td>19.11</td><td></td><td>bernstein/big-1/1</td></tr>
<tr><td>amd64</td><td>Intel Core 2 Duo (6f6)</td><td></td><td>19.5</td><td>OpenSSL 0.9.8a</td></tr>
<tr><td>x86</td><td>AMD Athlon (622)?</td><td></td><td>19.9375 (319/block)</td><td>Lipmaa</td></tr>
<tr><td>x86</td><td>Intel Pentium 1 (52c)</td><td></td><td>20 (320/block)</td><td>Lipmaa</td></tr>
<tr><td>sparc</td><td>Sun UltraSPARC III</td><td>20.75</td><td></td><td>Bernstein big-1/1</td></tr>
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)</td><td></td><td>20.9</td><td>OpenSSL 0.9.8e</td></tr>
<tr><td>ppc32</td><td>Motorola PowerPC G4 7400; nmi0042</td><td>20.92</td><td></td><td>bernstein/big-1/1</td></tr>
<tr><td>x86</td><td>Intel Pentium M (6d8)</td><td></td><td>21</td><td>OpenSSL 0.9.8a</td></tr>
<tr><td>x86</td><td>Intel Pentium D (f47); shell</td><td>21.58</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
<tr><td>x86</td><td>AMD Athlon (622)</td><td></td><td>22</td><td>OpenSSL 0.9.8a</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f29)</td><td></td><td>22</td><td>OpenSSL 0.9.8b</td></tr>
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)?</td><td></td><td>23.5</td><td>OpenSSL 0.9.7e</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f41)</td><td></td><td>23.5</td><td>OpenSSL 0.9.8a</td></tr>
<tr><td>x86</td><td>Intel Pentium 3 (672); orpheus</td><td></td><td>23.62</td><td>OpenSSL 0.9.8e</td></tr>
<tr><td>ppc32</td><td>Motorola PowerPC G4 7410</td><td></td><td>24.0625 (385/block)</td><td>Ahrens</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f12)</td><td></td><td>24.4</td><td>OpenSSL 0.9.8a</td></tr>
<tr><td>sparc</td><td>Sun UltraSPARC III</td><td></td><td>25</td><td>OpenSSL</td></tr>
<tr><td>ppc32</td><td>Motorola PowerPC G4 7410</td><td></td><td>25.0625 (401/block)</td><td>Ahrens</td></tr>
<tr><td>x86</td><td>Intel Core Duo; nmi0068</td><td>25.74</td><td></td><td>gladman/1</td></tr>
<tr><td>amd64</td><td>Intel Pentium D (f64); speed</td><td></td><td>27.33</td><td>OpenSSL 0.9.8e</td></tr>
<tr><td>ppc32</td><td>Motorola PowerPC G4 7410; gggg</td><td></td><td>29.32</td><td>OpenSSL 0.9.8c</td></tr>
<tr><td>sparcv9</td><td>Sun UltraSPARC III; nmi0051</td><td>29.45</td><td></td><td>bernstein/big-1/1</td></tr>
<tr><td>sparcv9</td><td>Sun UltraSPARC III; nmisolaris10</td><td>29.46</td><td></td><td>bernstein/big-1/1</td></tr>
<tr><td>ppc64</td><td>IBM Cell PPE; nmips3</td><td>35.20</td><td></td><td>bernstein/big-1/1</td></tr>
<tr><td>amd64</td><td>Intel Pentium 4 (f64)</td><td></td><td>37</td><td>OpenSSL 0.9.7f</td></tr>
<tr><td>x86</td><td>Intel Pentium 4 (f29)</td><td></td><td>39</td><td>OpenSSL 0.9.7e</td></tr>
<tr><td>sparc</td><td>Sun UltraSPARC III</td><td></td><td>46.875 (750/block)</td><td>Bassham</td></tr>
<tr><td>x86</td><td>Intel Pentium 1 (52c); cruncher</td><td>38.20</td><td></td><td>hongjun/v1/1</td></tr>
</tbody></table>
<p>
Regarding amd64 Intel Pentium 4,
Matsui writes:
"The number of memory reads
for one block encryption of AES
is 4 (for plaintext loads)
+ 11 x 4 (for subkey loads)
+ 16 x 10 (for table lookups)
= 208,
which means that Pentium 4 takes at least 208 cycles/block for one block encryption."
But this lower bound ignores the possibility of loading partially expanded keys,
saving as many as 30 loads,
and using 64-bit loads for keys and plaintext,
saving 9 more loads.
</p><p>
Regarding amd64 AMD Athlon 64,
Matsui writes:
"Considering an instruction latency of Athlon 64, the theoretical limit of AES
performance on this processor seems around 16 cycles/round = 160 cycles/block.
Our result is hence reaching closely this limit."
</p></body></html>

28
archive/manifest.yaml Normal file
View File

@ -0,0 +1,28 @@
# archive/manifest.yaml — curated list of works to preserve.
# Edited by hand. Tools never write to this file. See ARCHIVE.md.
#
# Per-artifact cap: 25 MB. Above that, archive.py warns and skips the fetch;
# commit an oversize artifact deliberately with `git add -f`.
#
# To evict an entry, see archive/removed.yaml — record there FIRST, then
# delete the line here, then run `make archive-gc`.
- url: "https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf"
slug: nist-fips-203
title: "FIPS 203 — Module-Lattice-Based Key-Encapsulation Mechanism Standard"
type: pdf
tags: [research]
note: >
The ML-KEM standard. Cited in the SIMD / post-quantum systems work;
archived so the citation survives any future reorganization of the
NIST publications site.
- url: "https://cr.yp.to/aes-speed.html"
slug: djb-aes-speed
title: "Cache-timing attacks on AES (cr.yp.to)"
# type: html — auto-detected from the .html extension; no override needed.
tags: [research]
note: >
Bernstein's cache-timing-attacks page, cited in the SIMD work. The
Phase 2 bootstrap entry: a stable, JavaScript-free static page, so its
monolith snapshot is reproducible and classifies cleanly as `ok`.

View File

@ -0,0 +1,14 @@
{
"url": "https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf",
"slug": "nist-fips-203",
"title": "FIPS 203 — Module-Lattice-Based Key-Encapsulation Mechanism Standard",
"type": "pdf",
"artifact": "document.pdf",
"sha256": "fe1f12f32a7e44ec9fdebbf400cda843a40b506dee676725234dc6f7923b6cac",
"previous-sha256": null,
"bytes": 1252341,
"archived": "2026-05-22",
"source-date": null,
"snapshot-quality": "ok",
"wayback": "http://web.archive.org/web/20260515100505/https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf"
}

Binary file not shown.

19
archive/removed.yaml Normal file
View File

@ -0,0 +1,19 @@
# archive/removed.yaml — record of evicted archive entries.
#
# Append an entry here BEFORE deleting its line from manifest.yaml, then
# run `make archive-gc`. The GC deletes only archive/<slug>/ directories
# whose slug is recorded here; an orphaned directory absent from this file
# is reported, never deleted. See ARCHIVE.md § Eviction & removal.
#
# Schema (all fields but `note` required):
# url: original URL at time of removal
# slug: the archive/<slug>/ directory archive-gc may delete
# removed: ISO date of removal
# reason: takedown | author-request | legal | quality
# note: optional free-text context
#
# This is not a hostile-tracking list — it exists so GC knows what is safe
# to delete, re-adding a removed URL is surfaced loudly, and the link-rot
# scanner and `archive-suggest` skip removed works.
[]

579
build/Archive.hs Normal file
View File

@ -0,0 +1,579 @@
{-# LANGUAGE GHC2021 #-}
{-# LANGUAGE OverloadedStrings #-}
-- | Archive section — the link-archiving system. Phases 1-2: PDF and HTML.
--
-- Authored input: archive/manifest.yaml (one line per archived link)
-- Generated, committed: archive/<slug>/{document.pdf | snapshot.html}
-- + PROVENANCE.json
-- Generated, gitignored: archive/<slug>/{document,snapshot}.txt
-- + data/archive-index.json
--
-- @tools/archive.py fetch@ runs before the Hakyll build: it downloads
-- PDFs / snapshots HTML pages with @monolith@, extracts text, and writes
-- each PROVENANCE.json. This module then routes the artifacts and renders
-- one @/archive/<slug>/@ page per entry plus the @/archive/@ index.
--
-- An entry whose artifact has not been fetched (no PROVENANCE.json, or
-- no artifact file on disk) is skipped — it produces no page, and an
-- orphaned @archive/<slug>/@ directory with no manifest line is inert
-- (no page, not deployed). Artifact-integrity (SHA-256) verification
-- runs on both sides: @archive.py fetch@ re-hashes before the Hakyll
-- build, and 'verifyArtifactSha' (below) re-hashes again in
-- 'loadArchiveEntries' — so the guarantee holds even when @archive.py@
-- does not run first (no @.venv@, a direct @cabal run site -- build@,
-- or a deploy host without the Python toolchain).
--
-- See @ARCHIVE.md@ at the repo root for the full design and phase plan.
module Archive (archiveRules, archiveBuildStats) where
import Control.Exception (SomeException, catch)
import Control.Monad (filterM, forM, when)
import Data.Function (on)
import Data.List (groupBy, intercalate, sort, sortBy)
import qualified Data.Map.Strict as Map
import Data.Maybe (catMaybes, fromMaybe)
import Data.Ord (Down (..), comparing)
import qualified Data.Set as Set
import qualified Data.Text as T
import Data.Time (Day, diffDays, fromGregorian,
getCurrentTime, utctDay)
import qualified Data.Aeson as A
import Data.Aeson ((.:), (.:?))
import qualified Data.Yaml as Y
import System.Directory (doesDirectoryExist, doesFileExist,
listDirectory)
import System.Exit (exitFailure)
import System.IO (hPutStrLn, readFile', stderr)
import System.Process (readProcess)
import Text.Read (readMaybe)
import Hakyll
import Contexts (siteCtx)
import Backlinks (referencedByField)
import SimilarLinks (similarLinksField)
import ArchiveIndex (ArchiveStatus (..), statusName,
archiveStatusForSlug, normalizeUrl)
-- ---------------------------------------------------------------------------
-- Data model
-- ---------------------------------------------------------------------------
-- | One authored entry in @archive/manifest.yaml@ — only the fields this
-- module consumes. @title:@, @type:@ and @tags:@ are read by
-- @tools/archive.py@ (title and type fold into PROVENANCE.json; tags are
-- Phase 4) and need no Haskell-side binding.
data ManifestEntry = ManifestEntry
{ meUrl :: String
, meNote :: Maybe String
, mePaywalled :: Bool
, meVisibility :: String -- ^ "public" (default) | "private"
}
instance A.FromJSON ManifestEntry where
parseJSON = A.withObject "ManifestEntry" $ \o -> do
url <- o .: "url"
note <- o .:? "note"
paywalled <- fromMaybe False <$> o .:? "paywalled"
visibility <- fromMaybe "public" <$> o .:? "visibility"
-- A publication/privacy field must fail closed: an unknown value
-- (e.g. a typo'd "privte") would otherwise be treated as public
-- and publish an artifact the author intended to keep offline.
when (visibility `notElem` ["public", "private"]) $ fail $
"manifest entry " ++ url
++ ": visibility must be \"public\" or \"private\", got "
++ show visibility
return (ManifestEntry url note paywalled visibility)
newtype RemovedEntry = RemovedEntry { reUrl :: String }
instance A.FromJSON RemovedEntry where
parseJSON = A.withObject "RemovedEntry" $ \o ->
RemovedEntry <$> o .: "url"
-- | One generated @archive/<slug>/PROVENANCE.json@ — the immutable
-- record of an archival event, written by @tools/archive.py@.
data Provenance = Provenance
{ pvUrl :: String
, pvSlug :: String
, pvTitle :: String
, pvType :: String -- ^ "pdf" | "html"
, pvArtifact :: String -- ^ "document.pdf" | "snapshot.html"
, pvSha256 :: String
, pvBytes :: Integer
, pvArchived :: String
, pvQuality :: String -- ^ "ok" | "degraded" | "js-required"
, pvWayback :: Maybe String
}
instance A.FromJSON Provenance where
parseJSON = A.withObject "Provenance" $ \o -> Provenance
<$> o .: "url"
<*> o .: "slug"
<*> o .: "title"
<*> o .: "type"
<*> o .: "artifact"
<*> o .: "sha256"
<*> o .: "bytes"
<*> o .: "archived"
<*> (fromMaybe "ok" <$> o .:? "snapshot-quality")
<*> o .:? "wayback"
-- | A renderable archive entry: the authored manifest line joined with
-- its generated provenance and extracted full text. @aeTextId@ is the
-- on-disk path of the extracted-text sidecar when it exists (it is
-- gitignored, so a no-@.venv@ build may lack it).
data ArchiveEntry = ArchiveEntry
{ aeManifest :: ManifestEntry
, aeProv :: Provenance
, aeFulltext :: String
, aeTextId :: Maybe FilePath
, aeStatus :: ArchiveStatus -- ^ link-rot status of the original
}
-- | The extracted-text sidecar name for an artifact type.
textFileFor :: Provenance -> String
textFileFor pv
| pvType pv == "html" = "snapshot.txt"
| otherwise = "document.txt"
-- | True for a @visibility: private@ entry — kept in-repo as a local
-- preservation copy, but its artifact is never routed to @_site/@ and
-- its extracted text is never rendered into the page.
isPrivate :: ArchiveEntry -> Bool
isPrivate = (== "private") . meVisibility . aeManifest
-- ---------------------------------------------------------------------------
-- Rule-generation-time IO (runs inside 'preprocess')
-- ---------------------------------------------------------------------------
manifestPath, removedPath :: FilePath
manifestPath = "archive/manifest.yaml"
removedPath = "archive/removed.yaml"
-- | Read @archive/manifest.yaml@. An absent file yields an empty list
-- (the archive degrades to invisible, matching the @.venv@-gated
-- silent-skip convention). A *parse error on a present file* halts the
-- build: the file exists but is broken — degrading to invisible would
-- swallow real errors like a typo'd @visibility@ value or a malformed
-- entry, both of which are publication-relevant.
readManifest :: IO [ManifestEntry]
readManifest = do
exists <- doesFileExist manifestPath
if not exists
then return []
else do
parsed <- Y.decodeFileEither manifestPath
case parsed of
Right es -> return es
Left e -> do
hPutStrLn stderr $
"[archive] FATAL: manifest.yaml: " ++ show e
exitFailure
readRemovedUrls :: IO (Set.Set T.Text)
readRemovedUrls = do
exists <- doesFileExist removedPath
if not exists
then return Set.empty
else do
parsed <- Y.decodeFileEither removedPath
case parsed of
Right entries -> return . Set.fromList $
map (normalizeUrl . T.pack . reUrl) (entries :: [RemovedEntry])
Left e -> do
hPutStrLn stderr $
"[archive] FATAL: removed.yaml: " ++ show e
exitFailure
validateManifestEntries :: [ManifestEntry] -> Set.Set T.Text -> IO ()
validateManifestEntries manifest removed = go Map.empty manifest
where
go _ [] = return ()
go seen (entry : rest) = do
let url = meUrl entry
norm = normalizeUrl (T.pack url)
when (norm `Set.member` removed) $ do
hPutStrLn stderr $
"[archive] FATAL: manifest URL " ++ show url
++ " is also recorded in removed.yaml; refusing to publish "
++ "a deliberately removed work."
exitFailure
case Map.lookup norm seen of
Just prior -> do
hPutStrLn stderr $
"[archive] FATAL: manifest URLs " ++ show prior ++ " and "
++ show url ++ " normalise to the same archive target."
exitFailure
Nothing -> go (Map.insert norm url seen) rest
-- | Scan @archive/<slug>/PROVENANCE.json@ into a @url -> (slug, Provenance)@
-- map. The directory name is the slug; the join key is the URL.
readProvenances :: IO (Map.Map String (String, Provenance))
readProvenances = do
exists <- doesDirectoryExist "archive"
if not exists
then return Map.empty
else do
names <- listDirectory "archive"
entries <- forM names $ \name -> do
let provPath = "archive/" ++ name ++ "/PROVENANCE.json"
isFile <- doesFileExist provPath
if not isFile
then return Nothing
else do
decoded <- A.eitherDecodeFileStrict' provPath
case decoded of
Right p -> return (Just (pvUrl p, (name, p)))
Left e -> do
hPutStrLn stderr $
"[archive] FATAL: " ++ provPath ++ ": " ++ show e
exitFailure
return (Map.fromList (catMaybes entries))
-- | Read a file, returning "" on any error (e.g. an absent text sidecar).
readFileSafe :: FilePath -> IO String
readFileSafe path =
catch (readFile' path) (\(_ :: SomeException) -> return "")
-- | Verify a committed artifact's SHA-256 against its recorded value.
-- The build halts with a clear message on mismatch — so the integrity
-- guarantee holds even when @tools/archive.py@ does not run first
-- (e.g. no @.venv@, or a direct @cabal run site -- build@), and a
-- tampered or corrupted artifact can never be deployed.
--
-- Shells out to @sha256sum@ (GNU coreutils — same toolchain the rest of
-- the build assumes); a missing or non-zero @sha256sum@ surfaces as an
-- exception that also halts the build.
verifyArtifactSha :: String -> FilePath -> String -> IO ()
verifyArtifactSha slug path expected = do
out <- readProcess "sha256sum" [path] ""
let actual = takeWhile (/= ' ') out
when (actual /= expected) $ do
hPutStrLn stderr $
"[archive] FATAL: " ++ slug ++ ": " ++ path
++ " SHA-256 mismatch (recorded " ++ expected
++ ", found " ++ actual
++ "). The committed artifact is corrupt or was replaced; "
++ "halting build."
exitFailure
-- | Join the authored manifest with generated provenance. A manifest
-- entry with no matching provenance — or whose artifact is not on disk
-- — is dropped, so it produces no page.
loadArchiveEntries :: IO [ArchiveEntry]
loadArchiveEntries = do
manifest <- readManifest
removed <- readRemovedUrls
validateManifestEntries manifest removed
provByUrl <- readProvenances
fmap catMaybes $ forM manifest $ \me ->
case Map.lookup (meUrl me) provByUrl of
Nothing -> return Nothing
Just (slug, pv) -> do
let dir = "archive/" ++ slug
txtPath = dir ++ "/" ++ textFileFor pv
let artPath = dir ++ "/" ++ pvArtifact pv
artifactThere <- doesFileExist artPath
if not artifactThere
then do
hPutStrLn stderr $
"[archive] FATAL: " ++ slug ++ ": " ++ artPath
++ " is missing although PROVENANCE.json exists; "
++ "restore the committed artifact before building."
exitFailure
else do
verifyArtifactSha slug artPath (pvSha256 pv)
txtThere <- doesFileExist txtPath
txt <- if txtThere then readFileSafe txtPath
else return ""
return $ Just ArchiveEntry
{ aeManifest = me
, aeProv = pv
, aeFulltext = txt
, aeTextId = if txtThere then Just txtPath
else Nothing
, aeStatus = archiveStatusForSlug slug
}
-- ---------------------------------------------------------------------------
-- Rules
-- ---------------------------------------------------------------------------
-- | All archive rules. Called once from 'Site.rules'.
archiveRules :: Rules ()
archiveRules = do
entries <- preprocess loadArchiveEntries
-- Raw artifacts: the PDF / HTML snapshot of every *public* entry,
-- served at its own path (/archive/<slug>/...). Routing this explicit
-- list rather than a glob means a `visibility: private` entry's
-- artifact is never deployed, and an orphan directory's artifact
-- (no manifest line) is not deployed either.
let publicArtifacts =
[ fromFilePath ("archive/" ++ pvSlug (aeProv e)
++ "/" ++ pvArtifact (aeProv e))
| e <- entries, not (isPrivate e) ]
match (fromList publicArtifacts) $ do
route idRoute
compile copyFileCompiler
-- Provenance, extracted text, and the manifest: matched (not routed)
-- so the generated pages can `load` them as dependencies and recompile
-- when they change.
match "archive/*/PROVENANCE.json" $ compile getResourceBody
match "archive/*/document.txt" $ compile getResourceBody
match "archive/*/snapshot.txt" $ compile getResourceBody
match "archive/manifest.yaml" $ compile getResourceBody
mapM_ archiveEntryRule entries
archiveIndexRule entries
-- | One @/archive/<slug>/@ page.
archiveEntryRule :: ArchiveEntry -> Rules ()
archiveEntryRule ae =
create [fromFilePath ("archive/" ++ slug ++ "/index.html")] $ do
route idRoute
compile $ do
-- Dependency edges: recompile when provenance or the manifest
-- changes. The extracted-text sidecar is gitignored and may be
-- absent (no .venv / fetch never ran); load it as a dependency
-- only when present, so the build never fails for a missing
-- generated file.
_ <- load provId :: Compiler (Item String)
_ <- load manifestId :: Compiler (Item String)
case aeTextId ae of
Just tp -> do
_ <- load (fromFilePath tp) :: Compiler (Item String)
return ()
Nothing -> return ()
makeItem ""
>>= loadAndApplyTemplate "templates/archive.html" ctx
>>= loadAndApplyTemplate "templates/default.html" ctx
>>= relativizeUrls
where
slug = pvSlug (aeProv ae)
provId = fromFilePath ("archive/" ++ slug ++ "/PROVENANCE.json")
manifestId = fromFilePath manifestPath
ctx = archiveEntryCtx ae
-- | The @/archive/@ index — every archived work, newest snapshot first.
archiveIndexRule :: [ArchiveEntry] -> Rules ()
archiveIndexRule entries =
create ["archive/index.html"] $ do
route idRoute
compile $ do
-- Recompile when any provenance appears / changes, or the
-- manifest changes.
_ <- loadAll "archive/*/PROVENANCE.json" :: Compiler [Item String]
_ <- load (fromFilePath manifestPath) :: Compiler (Item String)
let sorted = sortBy (comparing (Down . pvArchived . aeProv)) entries
items = map (\e -> Item (fromFilePath ("archive/" ++ pvSlug (aeProv e))) e)
sorted
ctx = listField "entries" entryListCtx (return items)
<> constField "title" "Archive"
<> constField "archive" "true"
<> constField "noindex" "true"
<> (if null entries then mempty
else constField "has-entries" "true")
<> siteCtx
makeItem ""
>>= loadAndApplyTemplate "templates/archive-index.html" ctx
>>= loadAndApplyTemplate "templates/default.html" ctx
>>= relativizeUrls
-- ---------------------------------------------------------------------------
-- Contexts
-- ---------------------------------------------------------------------------
-- | Per-entry context for the @/archive/<slug>/@ page.
archiveEntryCtx :: ArchiveEntry -> Context String
archiveEntryCtx ae = mconcat
[ constField "title" (pvTitle pv)
, constField "archive" "true"
, constField "noindex" "true"
, constField "original-url" (meUrl me)
, constField "archived" (pvArchived pv)
, constField "archive-type" (pvType pv)
, constField "sha-short" (take 12 (pvSha256 pv))
, constField "size" (formatBytes (pvBytes pv))
, constField "snapshot-quality" (pvQuality pv)
, constField "status" (statusName (aeStatus ae))
, qualityFlag
, maybeField "status-note" (statusNote (aeStatus ae))
, maybeField "note" (meNote me)
, maybeField "wayback" (pvWayback pv)
, maybeField "paywalled" (if mePaywalled me then Just "true" else Nothing)
, visibilityFields
-- "Referenced by" (the pages that cite this work) and "Related"
-- (semantically near content). Both resolve by this page's route, so
-- they need no archive-specific wiring; each is a $if(...)$-guarded
-- section in archive.html.
, referencedByField
, similarLinksField
, siteCtx
]
where
me = aeManifest ae
pv = aeProv ae
slug = pvSlug pv
artUrl = "/archive/" ++ slug ++ "/" ++ pvArtifact pv
-- A non-'ok' snapshot raises a visible flag on the page.
qualityFlag
| pvQuality pv == "ok" = mempty
| otherwise = constField "degraded" "true"
-- A private entry keeps a local preservation copy but publishes none
-- of it: no embed, no extracted text — only the provenance metadata
-- and a 'held offline' note. A public entry embeds the artifact raw
-- (the browser renders the PDF natively, the snapshot loads directly;
-- no PDF.js wrapper) and renders its extracted text into the page.
-- The is-pdf / is-html flag drives only the iframe sandbox: a
-- third-party HTML snapshot is sandboxed, our own committed PDF is not.
visibilityFields
| isPrivate ae = constField "private" "true"
| otherwise = typeField
<> constField "artifact-url" artUrl
<> constField "artifact-name" (pvArtifact pv)
<> fulltextField (pvType pv) (aeFulltext ae)
typeField
| pvType pv == "html" = constField "is-html" "true"
| otherwise = constField "is-pdf" "true"
-- | Renders the extracted full text into the page DOM so embed.py and
-- Pagefind index real text, not an opaque iframe. PDF text keeps its
-- pdftotext layout in a @<pre>@; HTML text is block-separated prose, so
-- it renders as escaped @<p>@ paragraphs. Absent when the text is empty
-- / whitespace, so the @$if(fulltext)$@ guard hides the section.
fulltextField :: String -> String -> Context String
fulltextField ftype txt
| all isBlank txt = mempty
| ftype == "html" = constField "fulltext" (htmlParagraphs txt)
| otherwise = constField "fulltext" preBlock
where
isBlank c = c == ' ' || c == '\n' || c == '\t' || c == '\r'
preBlock = "<pre class=\"archive-fulltext\">"
++ escapeHtml txt ++ "</pre>"
-- | Block-separated text (paragraphs delimited by blank lines, as
-- @archive.py@'s HTML extractor writes it) → escaped @<p>@ elements.
htmlParagraphs :: String -> String
htmlParagraphs = concatMap para . paragraphsOf
where
para p = "<p>" ++ escapeHtml p ++ "</p>\n"
paragraphsOf = map (unwords . concatMap words)
. filter (not . blankGroup)
. groupBy ((==) `on` blankLine)
. lines
blankGroup g = null g || blankLine (head g)
blankLine = all (`elem` (" \t\r" :: String))
-- | List-item context for the @/archive/@ index.
entryListCtx :: Context ArchiveEntry
entryListCtx = mconcat
[ field "entry-title" (return . pvTitle . aeProv . itemBody)
, field "entry-archived" (return . pvArchived . aeProv . itemBody)
, field "entry-type" (return . pvType . aeProv . itemBody)
, field "entry-quality" (return . pvQuality . aeProv . itemBody)
, boolField "entry-degraded" ((/= "ok") . pvQuality . aeProv . itemBody)
, boolField "entry-private" (isPrivate . itemBody)
, field "entry-status" (return . statusName . aeStatus . itemBody)
, boolField "entry-rotted" ((== Rotted) . aeStatus . itemBody)
, field "entry-url" (\i -> return $
"/archive/" ++ pvSlug (aeProv (itemBody i)) ++ "/")
]
-- | Provide a field only when the value is present; otherwise contribute
-- nothing, so the template's @$if(...)$@ guard is false.
maybeField :: String -> Maybe String -> Context String
maybeField k = maybe mempty (constField k)
-- | A prose note for a non-live link-rot status, shown on the archive
-- page; 'Nothing' for 'Live' / 'Error' (no note rendered).
statusNote :: ArchiveStatus -> Maybe String
statusNote Rotted = Just "The original is no longer reachable. This archived \
\copy is now the live link."
statusNote Moved = Just "The original page has moved since this snapshot was \
\taken; the link above may redirect."
statusNote _ = Nothing
-- ---------------------------------------------------------------------------
-- Formatting
-- ---------------------------------------------------------------------------
-- | Human-readable byte count (mirrors the helper in build/Stats.hs).
formatBytes :: Integer -> String
formatBytes b
| b < 1024 = show b ++ " B"
| b < 1024 * 1024 = showD (b * 10 `div` 1024) ++ " KB"
| otherwise = showD (b * 10 `div` (1024 * 1024)) ++ " MB"
where
showD n = show (n `div` 10) ++ "." ++ show (n `mod` 10)
-- ---------------------------------------------------------------------------
-- /build/ telemetry
-- ---------------------------------------------------------------------------
-- | Archive metrics for the @/build/@ telemetry page — count, total size,
-- median artifact age, breakdowns by link-rot status / snapshot quality
-- / visibility, the paywalled count, and any orphan directories.
-- Rendered by @Stats.hs@; an empty archive yields just the count.
archiveBuildStats :: IO [(String, String)]
archiveBuildStats = do
entries <- loadArchiveEntries
today <- utctDay <$> getCurrentTime
orphans <- findOrphanDirs entries
let n = length entries
bytes = sum (map (pvBytes . aeProv) entries)
ages = [ fromInteger (diffDays today d)
| e <- entries
, Just d <- [parseIsoDay (pvArchived (aeProv e))] ]
paywalled = length (filter (mePaywalled . aeManifest) entries)
return $
[ ("Entries", show n) ]
++ (if n == 0 then [] else
[ ("Total size", formatBytes bytes)
, ("Median age", medianAge ages)
, ("By status", tallyOf (map (statusName . aeStatus) entries))
, ("By quality", tallyOf (map (pvQuality . aeProv) entries))
, ("By visibility", tallyOf (map (meVisibility . aeManifest) entries))
])
++ [ ("Paywalled", show paywalled) | paywalled > 0 ]
++ [ ("Orphan directories", unwords orphans) | not (null orphans) ]
-- | Directory names under @archive/@ that hold a @PROVENANCE.json@ but are
-- not a live manifest entry — drift the @/build/@ page should surface.
findOrphanDirs :: [ArchiveEntry] -> IO [String]
findOrphanDirs entries = do
exists <- doesDirectoryExist "archive"
if not exists
then return []
else do
names <- listDirectory "archive"
let live = map (pvSlug . aeProv) entries
filterM
(\name -> do
hasProv <- doesFileExist
("archive/" ++ name ++ "/PROVENANCE.json")
return (hasProv && name `notElem` live))
(sort names)
-- | Format a multiset of string values as @"a 2 \183 b 1"@.
tallyOf :: [String] -> String
tallyOf xs = intercalate " \183 "
[ k ++ " " ++ show c
| (k, c) <- Map.toList (Map.fromListWith (+) [ (x, 1 :: Int) | x <- xs ]) ]
-- | The median of a list of ages, as @"N days"@; an em dash when empty.
medianAge :: [Int] -> String
medianAge [] = "\8212"
medianAge xs =
let m = sort xs !! (length xs `div` 2)
in show m ++ if m == 1 then " day" else " days"
-- | Parse a @YYYY-MM-DD@ date; 'Nothing' on malformed input.
parseIsoDay :: String -> Maybe Day
parseIsoDay s = case splitOnDash s of
[y, m, d] -> fromGregorian <$> readMaybe y <*> readMaybe m <*> readMaybe d
_ -> Nothing
where
splitOnDash str = case break (== '-') str of
(a, '-' : rest) -> a : splitOnDash rest
(a, _) -> [a]

255
build/ArchiveIndex.hs Normal file
View File

@ -0,0 +1,255 @@
{-# LANGUAGE GHC2021 #-}
{-# LANGUAGE OverloadedStrings #-}
-- | ArchiveIndex — shared read-only access to the archive's two JSON
-- sidecars: @data/archive-index.json@ (the @url\/alias -> slug@ map
-- written by @archive.py fetch@) and @data/archive-state.json@ (the
-- per-URL link-rot status written by @archive.py check@).
--
-- Consumers:
--
-- * @Filters.Archive@ — appends the archive affordance to body links
-- whose target is archived, and flips a @rotted@ link to the local
-- copy.
-- * @Backlinks@ — keeps archived external links through pass 1 and
-- canonicalises them to their @/archive/<slug>/@ page in pass 2.
-- * @Archive@ — surfaces each entry's rot status on its page, the
-- @/archive/@ index, and the @/build/@ telemetry.
--
-- Both files are loaded once per build via @unsafePerformIO@ CAFs. An
-- absent or malformed file degrades safely: an empty index makes the
-- link consumers no-op; an absent state file makes every entry @Live@
-- (the safe default — no link flip). @archive.py check@ is decoupled
-- from @make build@; a build consumes whatever state file exists.
module ArchiveIndex
( ArchiveStatus (..)
, statusName
, archiveSlugFor
, archiveStatusForSlug
, archiveIndexIsEmpty
, normalizeUrl
) where
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as Map
import Data.Maybe (fromMaybe)
import Data.Set (Set)
import qualified Data.Set as Set
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Aeson as A
import Data.Aeson ((.!=), (.:), (.:?))
import qualified Data.Yaml as Y
import System.Directory (doesFileExist)
import System.IO.Unsafe (unsafePerformIO)
-- ---------------------------------------------------------------------------
-- Link-rot status
-- ---------------------------------------------------------------------------
-- | The link-rot status of an archived work's original URL, as set by
-- @archive.py check@. 'Live' is the safe default for an unscanned or
-- unknown entry.
data ArchiveStatus = Live | Moved | Rotted | Error
deriving (Eq, Show)
-- | The lower-case wire name, matching @archive-state.json@ and the
-- @status:@ Pagefind filter tag.
statusName :: ArchiveStatus -> String
statusName Live = "live"
statusName Moved = "moved"
statusName Rotted = "rotted"
statusName Error = "error"
parseStatus :: Text -> ArchiveStatus
parseStatus "moved" = Moved
parseStatus "rotted" = Rotted
parseStatus "error" = Error
parseStatus _ = Live
-- ---------------------------------------------------------------------------
-- JSON shapes
-- ---------------------------------------------------------------------------
-- | One @archive-index.json@ entry. Only @slug@ and @aliases@ are used.
data IdxEntry = IdxEntry
{ ieSlug :: String
, ieAliases :: [Text]
}
instance A.FromJSON IdxEntry where
parseJSON = A.withObject "IdxEntry" $ \o -> IdxEntry
<$> o .: "slug"
<*> (o .:? "aliases" .!= [])
-- | One @archive-state.json@ entry — only the @status@ is consumed here.
newtype StateEntry = StateEntry { seStatus :: ArchiveStatus }
instance A.FromJSON StateEntry where
parseJSON = A.withObject "StateEntry" $ \o ->
StateEntry . parseStatus <$> (o .:? "status" .!= "live")
newtype UrlEntry = UrlEntry { ueUrl :: Text }
instance A.FromJSON UrlEntry where
parseJSON = A.withObject "UrlEntry" $ \o ->
UrlEntry <$> o .: "url"
-- ---------------------------------------------------------------------------
-- Loaded-once CAFs
-- ---------------------------------------------------------------------------
indexPath, statePath, manifestPath, removedPath :: FilePath
indexPath = "data/archive-index.json"
statePath = "data/archive-state.json"
manifestPath = "archive/manifest.yaml"
removedPath = "archive/removed.yaml"
readUrlSet :: FilePath -> IO (Set Text)
readUrlSet path = do
exists <- doesFileExist path
if not exists
then return Set.empty
else do
decoded <- Y.decodeFileEither path
case decoded of
Right entries -> return . Set.fromList $
map (normalizeUrl . ueUrl) (entries :: [UrlEntry])
Left e -> ioError . userError $
"[archive] FATAL: " ++ path ++ ": " ++ show e
-- | Canonical URLs still permitted to participate in link annotation.
-- Filtering the generated index at build time makes a direct Hakyll build
-- respect authored manifest/removal state even when archive.py did not run.
{-# NOINLINE activeUrls #-}
activeUrls :: Set Text
activeUrls = unsafePerformIO $ do
manifest <- readUrlSet manifestPath
removed <- readUrlSet removedPath
return (manifest `Set.difference` removed)
-- | @canonical-url -> entry@. Absent/malformed file -> empty; entries no
-- longer permitted by the authored manifest/removal state are removed.
{-# NOINLINE rawIndex #-}
rawIndex :: Map Text IdxEntry
rawIndex = unsafePerformIO $ do
decoded <- A.eitherDecodeFileStrict' indexPath
let parsed = either (const Map.empty) id decoded
return $ Map.filterWithKey
(\canon _ -> normalizeUrl canon `Set.member` activeUrls)
parsed
-- | @url -> status@. Absent/malformed file -> empty (every entry 'Live').
{-# NOINLINE rawState #-}
rawState :: Map Text ArchiveStatus
rawState = unsafePerformIO $ do
decoded <- A.eitherDecodeFileStrict' statePath
return $ either (const Map.empty) (Map.map seStatus) decoded
-- | @normalised-url -> slug@: the canonical key and every alias from
-- @archive-index.json@, each fed through 'normalizeUrl'. Both keys and
-- lookups are normalised, so a citation form the alias set cannot
-- enumerate (e.g. an unbounded arXiv version, or any tracking-laden
-- variant of a clean manifest URL) still resolves.
{-# NOINLINE flatIndex #-}
flatIndex :: Map Text String
flatIndex = Map.fromList
[ (normalizeUrl key, ieSlug e)
| (canon, e) <- Map.toList rawIndex
, key <- canon : ieAliases e
]
-- | @slug -> status@: each entry's status, looked up by its canonical URL
-- in the state file (the two files share the manifest URL as key).
{-# NOINLINE slugStatus #-}
slugStatus :: Map String ArchiveStatus
slugStatus = Map.fromList
[ (ieSlug e, Map.findWithDefault Live canon rawState)
| (canon, e) <- Map.toList rawIndex
]
-- ---------------------------------------------------------------------------
-- Public lookups
-- ---------------------------------------------------------------------------
-- | True when no archive index is available — the link consumers no-op.
archiveIndexIsEmpty :: Bool
archiveIndexIsEmpty = Map.null rawIndex
-- | The archive slug for an outbound URL, or 'Nothing'. Both the index
-- keys and the input go through 'normalizeUrl', so a citation form that
-- the alias set cannot enumerate — an unbounded arXiv version, or any
-- tracking-laden variant of a clean manifest URL — still resolves.
archiveSlugFor :: Text -> Maybe String
archiveSlugFor url = Map.lookup (normalizeUrl url) flatIndex
-- | The link-rot status of an archived entry, by slug. 'Live' for an
-- unknown slug or when no scan has run.
archiveStatusForSlug :: String -> ArchiveStatus
archiveStatusForSlug slug = Map.findWithDefault Live slug slugStatus
-- ---------------------------------------------------------------------------
-- URL normalisation (matching, not display)
-- ---------------------------------------------------------------------------
-- | Tracking-only query parameters: their presence or absence is
-- semantically irrelevant; the lookup strips them before matching.
-- Sync with @TRACKING_PARAMS@ in @tools/archive.py@.
trackingParams :: [Text]
trackingParams =
[ "utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content"
, "fbclid", "gclid", "mc_eid", "mc_cid", "ref", "igshid"
, "_hsenc", "_hsmi", "mkt_tok"
]
-- | Remove tracking-only query parameters; preserve every other parameter
-- in its original order.
stripTracking :: Text -> Text
stripTracking url = case T.breakOn "?" url of
(_, "") -> url
(path, q) ->
let kept = filter notTracking (T.splitOn "&" (T.drop 1 q))
in if null kept then path
else path <> "?" <> T.intercalate "&" kept
where
notTracking p = T.takeWhile (/= '=') p `notElem` trackingParams
-- | The canonical form of an arXiv URL: @https://arxiv.org/abs/<id>@ with
-- no version suffix and no @.pdf@. Maps every member of the
-- abs/pdf/versioned/@.pdf@ family to the same key. Non-arXiv passes through.
arxivCanonical :: Text -> Text
arxivCanonical url
| Just rest <- T.stripPrefix "https://arxiv.org/" url
, Just key <- arxivKey rest = key
| Just rest <- T.stripPrefix "http://arxiv.org/" url
, Just key <- arxivKey rest = key
| otherwise = url
where
arxivKey rest = case T.breakOn "/" rest of
(kind, slashId)
| kind `elem` ["abs", "pdf"], not (T.null slashId) ->
Just $ "https://arxiv.org/abs/"
<> stripVer (stripPdfSuf (T.tail slashId))
_ -> Nothing
stripPdfSuf t = fromMaybe t (T.stripSuffix ".pdf" t)
stripVer t = case T.breakOnEnd "v" t of
(before, ver)
| not (T.null before)
, not (T.null ver)
, T.all isAsciiDigit ver
-> T.dropEnd 1 before
_ -> t
isAsciiDigit c = c >= '0' && c <= '9'
-- | The full normalisation: drop fragment, strip tracking, fold
-- @http://@→@https://@, arXiv-canonicalise, trim a trailing slash. Both
-- 'flatIndex' keys and 'archiveSlugFor' inputs go through this so the
-- index never misses a citation form the design promises to match.
normalizeUrl :: Text -> Text
normalizeUrl url =
let noFrag = T.takeWhile (/= '#') url
clean = stripTracking noFrag
https = case T.stripPrefix "http://" clean of
Just rest -> "https://" <> rest
Nothing -> clean
arxiv = arxivCanonical https
in T.dropWhileEnd (== '/') arxiv

View File

@ -25,9 +25,11 @@
module Backlinks
( backlinkRules
, backlinksField
, referencedByField
) where
import Data.List (nubBy, sortBy)
import Data.List (nubBy, partition, sortBy,
stripPrefix)
import Data.Ord (comparing)
import Data.Maybe (fromMaybe)
import qualified Data.Map.Strict as Map
@ -50,6 +52,7 @@ import Hakyll
import Compilers (readerOpts, writerOpts)
import Filters (preprocessSource)
import qualified Patterns as P
import ArchiveIndex (archiveSlugFor)
-- ---------------------------------------------------------------------------
-- Link-with-context entry (intermediate, saved by the "links" pass)
@ -85,6 +88,7 @@ data BacklinkSource = BacklinkSource
, blAbstract :: String
, blSentence :: String -- raw HTML of the sentence containing the link
, blParagraph :: String -- raw HTML of the full paragraph (hover popup)
, blFragment :: String -- archived-target fragment (no '#'), else ""
} deriving (Show, Eq, Ord)
instance Aeson.ToJSON BacklinkSource where
@ -94,16 +98,18 @@ instance Aeson.ToJSON BacklinkSource where
, "abstract" .= blAbstract bl
, "sentence" .= blSentence bl
, "paragraph" .= blParagraph bl
, "fragment" .= blFragment bl
]
instance Aeson.FromJSON BacklinkSource where
parseJSON = Aeson.withObject "BacklinkSource" $ \o ->
BacklinkSource
<$> o Aeson..: "url"
<*> o Aeson..: "title"
<*> o Aeson..: "abstract"
<*> o Aeson..: "sentence"
<*> o Aeson..: "paragraph"
<$> o Aeson..: "url"
<*> o Aeson..: "title"
<*> o Aeson..: "abstract"
<*> o Aeson..: "sentence"
<*> o Aeson..: "paragraph"
<*> o Aeson..:? "fragment" Aeson..!= ""
-- ---------------------------------------------------------------------------
-- Writer options for context rendering
@ -125,15 +131,22 @@ contextWriterOpts = writerOpts
-- | URL filter: skip external links, pseudo-schemes, anchor-only fragments,
-- and static-asset paths.
isPageLink :: T.Text -> Bool
isPageLink u =
not (T.isPrefixOf "http://" u) &&
not (T.isPrefixOf "https://" u) &&
not (T.isPrefixOf "#" u) &&
not (T.isPrefixOf "mailto:" u) &&
not (T.isPrefixOf "tel:" u) &&
not (T.null u) &&
not (hasStaticExt u)
isPageLink u
-- An archived external URL is kept regardless of scheme or extension:
-- pass 2 inverts it to its /archive/<slug>/ page.
| isArchived = True
| otherwise =
not (T.isPrefixOf "http://" u) &&
not (T.isPrefixOf "https://" u) &&
not (T.isPrefixOf "#" u) &&
not (T.isPrefixOf "mailto:" u) &&
not (T.isPrefixOf "tel:" u) &&
not (T.null u) &&
not (hasStaticExt u)
where
isArchived = case archiveSlugFor u of
Just _ -> True
Nothing -> False
staticExts = [".pdf",".svg",".png",".jpg",".jpeg",".webp",
".mp3",".mp4",".woff2",".woff",".ttf",".ico",
".json",".asc",".xml",".gz",".zip"]
@ -289,6 +302,28 @@ percentDecode = T.unpack . TE.decodeUtf8With lenientDecode . pack . go
pack = BS.pack
lenientDecode = TE.lenientDecode
-- ---------------------------------------------------------------------------
-- Archive-aware target keying
-- ---------------------------------------------------------------------------
-- | The @data/backlinks.json@ key an outbound URL inverts to. An archived
-- external URL canonicalises to its @/archive/<slug>/@ page key — computed
-- exactly as 'backlinksFieldWith' computes the archive page's own key (the
-- same string fed through 'normaliseUrl'), so the two always agree. Every
-- other URL is normalised as before.
targetKey :: T.Text -> T.Text
targetKey u = case archiveSlugFor u of
Just slug -> T.pack (normaliseUrl ("/archive/" ++ slug ++ "/index.html"))
Nothing -> T.pack (normaliseUrl (T.unpack u))
-- | The fragment (without @#@) of an archived URL, for granular grouping
-- of "Referenced by". Empty for a non-archived URL or one with no fragment
-- — so granular grouping stays an archive-only behaviour.
archiveFragment :: T.Text -> String
archiveFragment u = case archiveSlugFor u of
Just _ -> T.unpack (T.drop 1 (T.dropWhile (/= '#') u))
Nothing -> ""
-- ---------------------------------------------------------------------------
-- Content patterns (must match the rules in Site.hs — sourced from
-- Patterns.allContent so additions to the canonical list automatically
@ -337,10 +372,11 @@ toSourcePairs item = do
:: Maybe [LinkEntry] of
Nothing -> return []
Just entries ->
return [ ( T.pack (normaliseUrl (T.unpack (leUrl e)))
return [ ( targetKey (leUrl e)
, BacklinkSource srcUrl title abstract
(leSentence e)
(leParagraph e)
(archiveFragment (leUrl e))
)
| e <- entries ]
@ -352,7 +388,20 @@ toSourcePairs item = do
-- to the current page, each with its paragraph context.
-- Returns @noResult@ (so @$if(backlinks)$@ is false) when there are none.
backlinksField :: Context String
backlinksField = field "backlinks" $ \item -> do
backlinksField = backlinksFieldWith renderBacklinks "backlinks"
-- | "Referenced by" for archive pages. Same lookup as 'backlinksField',
-- but the sources are grouped by the fragment each citation targets, so an
-- archived work's page can show which section/page each citing essay points
-- at (granular backlinks).
referencedByField :: Context String
referencedByField = backlinksFieldWith renderReferencedBy "referenced-by"
-- | Shared machinery for 'backlinksField' and 'referencedByField': look the
-- page up in @data/backlinks.json@ by its normalised route, then hand the
-- sorted sources to the given renderer.
backlinksFieldWith :: ([BacklinkSource] -> String) -> String -> Context String
backlinksFieldWith renderSources name = field name $ \item -> do
blItem <- load (fromFilePath "data/backlinks.json") :: Compiler (Item String)
case Aeson.decodeStrict (TE.encodeUtf8 (T.pack (itemBody blItem)))
:: Maybe (Map T.Text [BacklinkSource]) of
@ -367,7 +416,7 @@ backlinksField = field "backlinks" $ \item -> do
sorted = sortBy (comparing blTitle) sources
in if null sorted
then fail "no backlinks"
else return (renderBacklinks sorted)
else return (renderSources sorted)
-- ---------------------------------------------------------------------------
-- HTML rendering
@ -384,25 +433,59 @@ backlinksField = field "backlinks" $ \item -> do
renderBacklinks :: [BacklinkSource] -> String
renderBacklinks sources =
"<ul class=\"backlinks-list\">\n"
++ concatMap renderOne sources
++ concatMap renderBacklinkItem sources
++ "</ul>"
where
renderOne bl =
"<li class=\"backlink-item\">"
++ "<a class=\"backlink-source\" href=\""
++ escapeHtml (blUrl bl) ++ "\">"
++ escapeHtml (blTitle bl) ++ "</a>"
++ ( if null (blSentence bl) then ""
else "<blockquote class=\"backlink-quote\">"
++ blSentence bl
++ paragraphAffordance bl
++ "</blockquote>" )
++ "</li>\n"
paragraphAffordance bl
| null (blParagraph bl) = ""
| blParagraph bl == blSentence bl = ""
| otherwise =
-- | "Referenced by", grouped by the fragment each citation targets.
-- Sources citing the work with no fragment render first as a plain list;
-- each distinct fragment then gets its own subheading. With no fragments
-- anywhere (the common case) this collapses to exactly the flat list.
renderReferencedBy :: [BacklinkSource] -> String
renderReferencedBy sources =
let (general, fragmented) = partition (null . blFragment) sources
groups = Map.toList $ Map.fromListWith (flip (++))
[ (blFragment s, [s]) | s <- fragmented ]
in renderList general ++ concatMap renderGroup groups
where
renderList [] = ""
renderList ss = "<ul class=\"backlinks-list\">\n"
++ concatMap renderBacklinkItem ss ++ "</ul>\n"
renderGroup (frag, ss) =
"<div class=\"referenced-by-group\">"
++ "<h3 class=\"referenced-by-fragment\">"
++ escapeHtml (fragmentLabel frag) ++ "</h3>"
++ renderList ss
++ "</div>\n"
-- | Human label for a cited fragment: a PDF @#page=N@ becomes "Page N";
-- any other @#anchor@ is shown verbatim behind a section mark.
fragmentLabel :: String -> String
fragmentLabel frag =
case stripPrefix "page=" frag of
Just n -> "Page " ++ n
Nothing -> "\x00A7 " ++ frag
-- | One backlink @<li>@: the source title as a link, the sentence of
-- context as a blockquote, and a hover affordance revealing the full
-- paragraph. 'blSentence' / 'blParagraph' are already HTML fragments from
-- the Pandoc writer, so they are emitted unescaped.
renderBacklinkItem :: BacklinkSource -> String
renderBacklinkItem bl =
"<li class=\"backlink-item\">"
++ "<a class=\"backlink-source\" href=\""
++ escapeHtml (blUrl bl) ++ "\">"
++ escapeHtml (blTitle bl) ++ "</a>"
++ ( if null (blSentence bl) then ""
else "<blockquote class=\"backlink-quote\">"
++ blSentence bl
++ paragraphAffordance
++ "</blockquote>" )
++ "</li>\n"
where
paragraphAffordance
| null (blParagraph bl) = ""
| blParagraph bl == blSentence bl = ""
| otherwise =
"<span class=\"backlink-full\">"
++ "<button type=\"button\" class=\"backlink-full-trigger\""
++ " aria-label=\"Show full paragraph\" tabindex=\"0\">\x00B6</button>"

View File

@ -13,6 +13,7 @@ import qualified Filters.Typography as Typography
import qualified Filters.Links as Links
import qualified Filters.SourceRefs as SourceRefs
import qualified Filters.Smallcaps as Smallcaps
import qualified Filters.Archive as Archive
import qualified Filters.Dropcaps as Dropcaps
import qualified Filters.Math as Math
import qualified Filters.Wikilinks as Wikilinks
@ -40,6 +41,7 @@ applyAll srcDir doc = do
. Sidenotes.apply
. Typography.apply
. Links.apply
. Archive.apply
. Smallcaps.apply
. Dropcaps.apply
. Math.apply

82
build/Filters/Archive.hs Normal file
View File

@ -0,0 +1,82 @@
{-# LANGUAGE GHC2021 #-}
{-# LANGUAGE OverloadedStrings #-}
-- | Filters.Archive — annotate (and, for dead links, redirect) body links
-- to archived works.
--
-- For every @Link@ whose URL matches an entry in @data/archive-index.json@
-- (the equivalent-URL alias set included):
--
-- * a 'live', 'moved' or (inconclusive) 'error' target keeps its
-- original link and gains a small superscript affordance pointing at
-- the local @/archive/<slug>/@ page — purely additive;
--
-- * a 'rotted' target (confirmed dead by @archive.py check@'s
-- hysteresis) has its primary link flipped to the archived copy, so
-- a reader of an old essay reaches a working snapshot instead of a
-- 404. A "archived" marker replaces the affordance.
--
-- Registered in 'Filters.applyAll' immediately after @Smallcaps@ and
-- before @Links@: it must see the smallcaps-rewritten text, and it emits
-- the affordance/marker as @RawInline@ so the downstream @Links@ pass
-- never re-classifies it.
--
-- No-op when @data/archive-index.json@ is absent. When no rot scan has
-- run, every entry is 'Live' — no link is ever flipped.
module Filters.Archive (apply) where
import qualified Data.Text as T
import Text.Pandoc.Definition
import Text.Pandoc.Walk (walk)
import ArchiveIndex (ArchiveStatus (..), archiveIndexIsEmpty,
archiveSlugFor, archiveStatusForSlug)
-- | Annotate body links. Headings are left alone — an affordance there
-- would be noise. Identity when the index is empty.
apply :: Pandoc -> Pandoc
apply doc@(Pandoc meta blocks)
| archiveIndexIsEmpty = doc
| otherwise = Pandoc meta (map annotateBlock blocks)
annotateBlock :: Block -> Block
annotateBlock h@Header{} = h
annotateBlock b = walk annotateInlines b
-- | For each archived @Link@: flip it if the target is 'Rotted', else
-- append the affordance. Non-archived links pass through untouched.
annotateInlines :: [Inline] -> [Inline]
annotateInlines = concatMap expand
where
expand l@(Link attr text (url, _)) =
case archiveSlugFor url of
Nothing -> [l]
Just slug -> case archiveStatusForSlug slug of
Rotted -> [flipped slug attr text, marker slug "rotted"
"The original is a dead link &mdash; \
\opens the local archived copy"]
_ -> [l, marker slug "" "Archived &mdash; \
\local preservation copy"]
expand x = [x]
-- | A 'Rotted' link, redirected to the local archived copy. Keeps the
-- link text; the @archive-rotted@ class lets CSS mark it.
flipped :: String -> Attr -> [Inline] -> Inline
flipped slug (ident, classes, kvs) text =
Link (ident, "archive-rotted" : classes, kvs) text
( T.pack ("/archive/" ++ slug ++ "/")
, "Original link is dead \8212 opens the local archived copy" )
-- | The superscript marker after the link: "A" for a normal affordance,
-- "archived" for a flipped dead link. Emitted as raw HTML so the
-- downstream @Links@ filter (which classifies @Link@ nodes) leaves it
-- alone. Slugs are @[a-z0-9-]@ by construction in @archive.py@.
marker :: String -> String -> T.Text -> Inline
marker slug modifier title = RawInline "html" $ T.concat
[ "<sup class=\"archive-affordance", modifierClass, "\">"
, "<a href=\"/archive/", T.pack slug, "/\" title=\"", title, "\">"
, label, "</a></sup>"
]
where
modifierClass = if null modifier
then ""
else " archive-affordance--" <> T.pack modifier
label = if null modifier then "A" else "archived"

View File

@ -1,7 +1,23 @@
module Main where
import Hakyll (hakyll)
import Site (rules)
import Data.Time.Clock.POSIX (getPOSIXTime)
import System.Directory (createDirectoryIfMissing)
import Hakyll (hakyll)
import Site (rules)
-- | Stamp the start of this build into @data/build-stamp.txt@ before
-- Hakyll scans the provider directory. The file therefore always exists
-- and always differs from the previous run. The telemetry pages
-- (@/build/@, @/stats/@) @load@ it as a dependency so Hakyll recompiles
-- them on every build instead of serving a stale cached copy when no
-- tracked content changed. See build/Stats.hs and build/Site.hs.
writeBuildStamp :: IO ()
writeBuildStamp = do
createDirectoryIfMissing True "data"
t <- getPOSIXTime
writeFile "data/build-stamp.txt" (show t ++ "\n")
main :: IO ()
main = hakyll rules
main = do
writeBuildStamp
hakyll rules

View File

@ -19,6 +19,7 @@ import qualified Data.Aeson as Aeson
import qualified Data.ByteString.Lazy.Char8 as LBS
import qualified Data.Map.Strict as Map
import Hakyll
import Archive (archiveRules)
import Authors (buildAllAuthors, applyAuthorRules)
import Backlinks (backlinkRules)
import BibExtras (BibExtra (..), emptyBibExtra, firstAuthorSurname, parseBibExtras)
@ -265,6 +266,13 @@ rules = do
-- /current.html. Re-compiles current.html when the YAML changes.
match "data/now.yaml" $ compile getResourceBody
-- Per-build stamp — written by Main.main before Hakyll starts, so it
-- always exists and always differs from the previous run. Matched
-- (not routed) purely so the telemetry pages can `load` it as a
-- dependency and thus recompile every build instead of serving a
-- stale cached copy. See build/Stats.hs.
match "data/build-stamp.txt" $ compile getResourceBody
-- ---------------------------------------------------------------------------
-- Homepage
-- ---------------------------------------------------------------------------
@ -529,6 +537,13 @@ rules = do
-- ---------------------------------------------------------------------------
photographyRules
-- ---------------------------------------------------------------------------
-- Archive — link-archiving system: per-entry /archive/<slug>/ pages and
-- the /archive/ index, driven by archive/manifest.yaml + PROVENANCE.json.
-- See build/Archive.hs and ARCHIVE.md for the design.
-- ---------------------------------------------------------------------------
archiveRules
-- ---------------------------------------------------------------------------
-- Blog index (paginated)
-- ---------------------------------------------------------------------------
@ -926,6 +941,13 @@ rules = do
create ["robots.txt"] $ do
route idRoute
compile $ makeItem $ unlines
-- /archive/ is *deliberately not* disallowed. Crawlers must be
-- able to reach the wrapper pages (and snapshot.html) to see
-- their <meta name=robots content="noindex, noarchive">; a
-- robots.txt Disallow would block that and a URL blocked only
-- by robots.txt can still appear in results when linked. The
-- raw PDFs cannot carry meta — they need an `X-Robots-Tag`
-- HTTP header from the deploy webserver (see nginx/archive.conf).
[ "User-agent: *"
, "Allow: /"
, ""

View File

@ -37,6 +37,7 @@ import qualified Text.Blaze.Html5.Attributes as A
import Text.Blaze.Html.Renderer.String (renderHtml)
import qualified Text.Blaze.Internal as BI
import Hakyll
import Archive (archiveBuildStats)
import Contexts (siteCtx, authorLinksField)
import qualified Patterns as P
import Utils (readingTime)
@ -707,6 +708,14 @@ renderBuild ts dur =
, ("Last build duration", txt dur)
]
-- | Link-archive coverage and health. The metric rows are computed by
-- 'Archive.archiveBuildStats' (count, size, link-rot status breakdown,
-- snapshot quality, visibility, orphans); this only lays them out.
renderArchive :: [(String, String)] -> H.Html
renderArchive metrics =
section "archive" "Link archive" $
dl [ (k, txt v) | (k, v) <- metrics ]
-- ---------------------------------------------------------------------------
-- Static TOC (matches the nine h2 sections above)
-- ---------------------------------------------------------------------------
@ -726,6 +735,7 @@ pageTOC = H.ol $ mapM_ item sections
, ("links", "Links")
, ("epistemic", "Epistemic coverage")
, ("output", "Output")
, ("archive", "Link archive")
, ("repository", "Repository")
, ("build", "Build")
]
@ -743,6 +753,16 @@ statsRules tags = do
create ["build/index.html"] $ do
route idRoute
compile $ do
-- ----------------------------------------------------------------
-- Per-build stamp dependency: data/build-stamp.txt is rewritten
-- by Main.main on every invocation, so loading it here forces
-- Hakyll to recompile this page each build. Without it the page
-- is served from cache whenever no tracked content changed, and
-- every unsafeCompiler-sourced figure below (timestamp, output
-- stats, git, LOC) goes stale. The value itself is unused.
-- ----------------------------------------------------------------
_ <- load (fromFilePath "data/build-stamp.txt") :: Compiler (Item String)
-- ----------------------------------------------------------------
-- Load all content items
-- ----------------------------------------------------------------
@ -846,6 +866,11 @@ statsRules tags = do
(hf, hl, cf, cl, jf, jl) <- unsafeCompiler getLocStats
(commits, firstDate) <- unsafeCompiler getGitStats
-- ----------------------------------------------------------------
-- Link-archive coverage + link-rot health
-- ----------------------------------------------------------------
archiveMetrics <- unsafeCompiler archiveBuildStats
-- ----------------------------------------------------------------
-- Build timestamp + last build duration
-- ----------------------------------------------------------------
@ -869,6 +894,7 @@ statsRules tags = do
renderLinks mostLinkedInfo orphanCount (length allPIs)
renderEpistemic epTotal withStatus withConf withImp withEv
renderOutput outputGrouped totalFiles totalSize
renderArchive archiveMetrics
renderRepository hf hl cf cl jf jl commits firstDate
renderBuild buildTimestamp lastBuildDur
contentString = renderHtml htmlContent
@ -897,6 +923,11 @@ statsRules tags = do
create ["stats/index.html"] $ do
route idRoute
compile $ do
-- Per-build stamp dependency — forces a recompile every build
-- so the heatmap's "today" and all corpus figures stay current.
-- See the /build/ rule above for the full rationale.
_ <- load (fromFilePath "data/build-stamp.txt") :: Compiler (Item String)
essays <- loadAll (P.essayPattern .&&. hasNoVersion)
posts <- loadAll ("content/blog/*.md" .&&. hasNoVersion)
poems <- loadAll ("content/poetry/*.md" .&&. hasNoVersion)

View File

@ -13,6 +13,8 @@ executable site
hs-source-dirs: build
other-modules:
Site
Archive
ArchiveIndex
Authors
Catalog
Commonplace
@ -36,6 +38,7 @@ executable site
Filters.Sidenotes
Filters.Dropcaps
Filters.Smallcaps
Filters.Archive
Filters.Wikilinks
Filters.Transclusion
Filters.EmbedPdf

45
nginx/archive.conf Normal file
View File

@ -0,0 +1,45 @@
# archive.conf — `X-Robots-Tag: noindex, noarchive` for the link archive.
#
# Place at /etc/nginx/snippets/archive.conf and `include` it inside the
# levineuwirth.org server { } block, *after* security-headers.conf:
#
# server {
# server_name levineuwirth.org;
# root /var/www/levineuwirth.org;
# ...
# include snippets/security-headers.conf;
# include snippets/static-assets.conf;
# include snippets/popup-proxy.conf;
# include snippets/archive.conf;
# }
#
# Why a location header rather than robots.txt: a URL blocked by
# robots.txt can still appear in results when externally linked, and the
# noindex directive must be reachable. Wrapper pages carry the meta in
# HTML, and the HTML snapshots have the same meta injected at fetch
# time. But raw PDFs cannot carry meta directives — and a robots.txt
# Disallow on /archive/ would prevent crawlers from reading the wrapper
# meta in the first place. The header form is the right control for the
# whole tree: crawlers honour it for any resource, HTML or PDF.
#
# `^~` makes this prefix-match take priority over any regex location
# that might match the same path.
location ^~ /archive/ {
# nginx's add_header chain is inherited from a parent context ONLY
# when the current context declares no add_header directives — see
# nginx.org/en/docs/http/ngx_http_headers_module.html. Adding any
# header inside this location would silently drop the baseline
# security headers within the /archive/ subtree, so we re-include
# security-headers.conf to keep HSTS, CSP, X-Frame-Options, etc.
# intact for archive pages and raw artifacts.
include snippets/security-headers.conf;
# `always` so the header is emitted even on 4xx/5xx responses (the
# default add_header only sets on 2xx/3xx; without `always` a 404
# under /archive/ could be indexed).
add_header X-Robots-Tag "noindex, noarchive" always;
# Hand off to the same static-file fallback as the rest of the site.
try_files $uri $uri/index.html $uri.html =404;
}

View File

@ -42,6 +42,12 @@ server {
include snippets/security-headers.conf;
include snippets/static-assets.conf;
include snippets/popup-proxy.conf;
# archive.conf must come *after* security-headers.conf — it declares
# its own add_header inside `location ^~ /archive/`, which (per the
# nginx add_header inheritance rules) would otherwise drop the
# baseline headers within that subtree. The snippet re-includes
# security-headers.conf inside its location to compensate.
include snippets/archive.conf;
# Static-site fallback. Pretty URLs first (foo/index.html, foo.html),
# then 404.

463
static/css/archive.css Normal file
View File

@ -0,0 +1,463 @@
/* archive.css the link archive: /archive/ and /archive/<slug>/.
*
* Gated in head.html via $if(archive)$ (build/Archive.hs sets the flag on
* the index and every entry page). The archive pages are structured
* surfaces rather than prose, but they render inside #markdownBody so
* every rule here is scoped under #markdownBody to clear the id-specificity
* prose rules in typography.css (heading scales, figure framing, paragraph
* indent) that would otherwise win over a bare class.
*
* Treatment: "framed / structured" the archival chrome (banner,
* provenance panel, the embedded artifact viewer) is given visible borders
* so a reader is never in doubt that this is a preservation copy, not the
* original. All colour comes from tokens, so dark mode follows for free;
* the embedded artifact itself is shown raw and is deliberately not themed.
*/
/* Structured pages, not essays — no first-line indent on any paragraph. */
#markdownBody :is(.archive-banner-text, .archive-degraded, .archive-note,
.archive-private, .archive-status-note, .archive-index-intro,
.archive-removal, .archive-empty),
#markdownBody .archive-fulltext-wrap > p {
text-indent: 0;
}
/* ============================================================
ENTRY HEADER + ARCHIVAL BANNER
The banner is a bordered callout, stacked: a small-caps label,
one plain-language line, and the original link given real
weight the original is the hero, never the archived copy.
============================================================ */
#markdownBody .archive-header {
margin-bottom: 0.5rem;
}
#markdownBody .archive-header .page-title {
margin-bottom: 0;
}
#markdownBody .archive-banner {
margin-top: 1.4rem;
padding: 0.9rem 1.1rem;
display: flex;
flex-direction: column;
gap: 0.3rem;
border: 1px solid var(--border-muted);
border-radius: 2px;
background: var(--bg-subtle);
}
#markdownBody .archive-banner-label {
margin: 0;
font-family: var(--font-sans);
font-size: 0.7rem;
font-variant: all-small-caps;
font-feature-settings: "smcp" 1;
letter-spacing: 0.13em;
color: var(--text-muted);
}
#markdownBody .archive-banner-text {
margin: 0;
font-family: var(--font-serif);
font-size: 0.95rem;
line-height: 1.5;
color: var(--text);
}
#markdownBody .archive-banner-original {
align-self: flex-start;
font-family: var(--font-sans);
font-size: 0.85rem;
font-weight: 600;
}
/* Degraded / js-required snapshots: a dashed-border note. Restrained
the monochrome palette has no alarm colour and wants none. */
#markdownBody .archive-degraded {
margin: 1rem 0 0;
padding: 0.7rem 1rem;
border: 1px dashed var(--border-muted);
border-radius: 2px;
font-family: var(--font-serif);
font-size: 0.9rem;
line-height: 1.55;
color: var(--text-muted);
}
#markdownBody .archive-degraded-label {
margin-right: 0.4rem;
font-family: var(--font-sans);
font-size: 0.7rem;
font-variant: all-small-caps;
font-feature-settings: "smcp" 1;
letter-spacing: 0.1em;
color: var(--text);
}
/* Private entry: the artifact is held offline, not published a calm
informational panel in place of the artifact viewer. */
#markdownBody .archive-private {
margin: 1.8rem 0;
padding: 1rem 1.2rem;
border: 1px solid var(--border);
border-radius: 2px;
background: var(--bg-subtle);
font-family: var(--font-serif);
font-size: 0.95rem;
line-height: 1.6;
color: var(--text-muted);
}
/* Link-rot status a header note for non-live states (archive.py check),
and the status word in the provenance panel. The palette is monochrome,
so a `rotted` entry is marked by weight and a heavier left rule, never
colour. */
#markdownBody .archive-status-note {
margin: 1rem 0 0;
padding: 0.7rem 1rem;
border: 1px solid var(--border-muted);
border-left-width: 3px;
border-radius: 2px;
font-family: var(--font-serif);
font-size: 0.92rem;
line-height: 1.55;
color: var(--text);
}
#markdownBody .archive-status-note--rotted {
border-left-color: var(--text);
}
#markdownBody .archive-status-note--moved {
color: var(--text-muted);
}
#markdownBody .archive-status {
font-variant: all-small-caps;
font-feature-settings: "smcp" 1;
letter-spacing: 0.04em;
}
#markdownBody .archive-status--live {
color: var(--text-muted);
}
#markdownBody .archive-status--rotted {
font-weight: 600;
}
/* ============================================================
PROVENANCE PANEL
A bordered box with a small-caps label; the metadata is a
two-column key/value grid labels auto-sized, values take
the rest, long URLs and hashes wrap rather than overflow.
============================================================ */
#markdownBody .archive-provenance {
margin: 1.8rem 0;
padding: 1rem 1.2rem 1.1rem;
border: 1px solid var(--border);
border-radius: 2px;
}
#markdownBody .archive-panel-title {
margin: 0 0 0.7rem;
font-family: var(--font-sans);
font-size: 0.72rem;
font-weight: 600;
font-variant: all-small-caps;
font-feature-settings: "smcp" 1;
letter-spacing: 0.12em;
color: var(--text-faint);
}
#markdownBody .archive-meta {
margin: 0;
display: grid;
grid-template-columns: max-content 1fr;
gap: 0.34rem 1.1rem;
}
#markdownBody .archive-meta dt {
font-family: var(--font-sans);
font-size: 0.78rem;
font-variant: all-small-caps;
font-feature-settings: "smcp" 1;
letter-spacing: 0.05em;
color: var(--text-faint);
}
#markdownBody .archive-meta dd {
margin: 0;
font-family: var(--font-serif);
font-size: 0.92rem;
color: var(--text);
overflow-wrap: anywhere;
}
#markdownBody .archive-meta dd code {
font-family: var(--font-mono);
font-size: 0.82rem;
}
/* The author's reason-for-archiving note, set in the page measure. */
#markdownBody .archive-note {
margin: 1.6rem 0;
font-family: var(--font-serif);
font-size: 0.97rem;
font-style: italic;
line-height: 1.6;
color: var(--text-muted);
}
/* ============================================================
ARTIFACT VIEWER
A <div> (not a <figure> that carries prose framing) with a
mono caption bar that names the raw artifact and links to it,
and the artifact embedded raw beneath: the PDF renders in the
browser's native viewer, the HTML snapshot loads sandboxed.
============================================================ */
#markdownBody .archive-viewer {
margin: 1.8rem 0;
border: 1px solid var(--border-muted);
border-radius: 2px;
overflow: hidden;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.03);
}
#markdownBody .archive-viewer-bar {
display: flex;
align-items: baseline;
justify-content: space-between;
gap: 1rem;
padding: 0.45rem 0.75rem;
border-bottom: 1px solid var(--border-muted);
background: var(--bg-subtle);
}
#markdownBody .archive-viewer-name {
font-family: var(--font-mono);
font-size: 0.78rem;
color: var(--text-muted);
}
#markdownBody .archive-viewer-open {
font-family: var(--font-sans);
font-size: 0.76rem;
white-space: nowrap;
}
#markdownBody .archive-frame {
display: block;
width: 100%;
height: 80vh;
border: 0;
background: var(--bg);
}
/* ============================================================
EXTRACTED FULL TEXT
Always in the DOM, for embed.py / Pagefind. PDF text is
collapsed in a <details> and keeps its pdftotext layout in a
scrollable mono block; HTML text shows as serif paragraphs.
============================================================ */
#markdownBody .archive-fulltext-wrap {
margin: 1.8rem 0 0;
}
#markdownBody .archive-fulltext-title,
#markdownBody .archive-section-title {
margin: 0 0 0.6rem;
padding-bottom: 0.4rem;
border-bottom: 1px solid var(--border);
font-family: var(--font-sans);
font-size: 0.78rem;
font-weight: 600;
font-variant: all-small-caps;
font-feature-settings: "smcp" 1;
letter-spacing: 0.1em;
color: var(--text-muted);
}
#markdownBody summary.archive-fulltext-title {
cursor: pointer;
}
#markdownBody .archive-fulltext-wrap > p {
margin: 0 0 0.85rem;
font-family: var(--font-serif);
font-size: 0.95rem;
line-height: 1.6;
color: var(--text);
}
/* The pdftotext block: scroll-capped so it never dominates the page. */
#markdownBody .archive-fulltext {
margin: 0.8rem 0 0;
padding: 0.9rem 1rem;
max-height: 60vh;
overflow: auto;
border: 1px solid var(--border);
border-radius: 2px;
background: var(--bg-subtle);
font-family: var(--font-mono);
font-size: 0.8rem;
line-height: 1.5;
color: var(--text-muted);
white-space: pre-wrap;
overflow-wrap: anywhere;
}
/* ============================================================
REFERENCED BY / RELATED
The site-wide .backlinks-list / .similar-links-list styles
(components.css) carry the lists themselves; these rules add
only the section framing and the granular fragment groups.
============================================================ */
#markdownBody .archive-backlinks,
#markdownBody .archive-related {
margin: 1.8rem 0 0;
}
#markdownBody .referenced-by-group {
margin-top: 0.9rem;
}
#markdownBody .referenced-by-fragment {
margin: 0 0 0.3rem;
font-family: var(--font-sans);
font-size: 0.72rem;
font-weight: 600;
font-variant: all-small-caps;
font-feature-settings: "smcp" 1;
letter-spacing: 0.08em;
color: var(--text-faint);
}
/* ============================================================
REMOVAL NOTICE
A quiet italic footer line, set off by a top rule present
on every archive page and on the index.
============================================================ */
#markdownBody .archive-removal {
margin: 2.4rem 0 0;
padding-top: 1rem;
border-top: 1px solid var(--border);
font-family: var(--font-serif);
font-size: 0.85rem;
font-style: italic;
line-height: 1.55;
color: var(--text-faint);
}
/* ============================================================
INDEX PAGE /archive/
A text list in the catalog idiom: one hairline between rows,
the title in serif, type + date + any quality flag in quiet
sans pushed to the row's end.
============================================================ */
#markdownBody .archive-index-header {
margin-bottom: 1.8rem;
}
#markdownBody .archive-index-intro {
margin: 0.6rem 0 0;
font-family: var(--font-serif);
font-size: 1rem;
line-height: 1.6;
color: var(--text-muted);
}
#markdownBody .archive-list {
margin: 0;
padding: 0;
list-style: none;
}
#markdownBody .archive-list-item {
display: flex;
align-items: baseline;
justify-content: space-between;
gap: 0.4rem 1rem;
flex-wrap: wrap;
padding: 0.7rem 0;
border-bottom: 1px solid var(--border);
}
#markdownBody .archive-list-item:last-child {
border-bottom: none;
}
#markdownBody .archive-list-link {
font-family: var(--font-serif);
font-size: 1.05rem;
color: var(--text);
text-decoration: none;
}
#markdownBody .archive-list-link:hover {
text-decoration: underline;
text-underline-offset: 2px;
}
#markdownBody .archive-list-meta {
font-family: var(--font-sans);
font-size: 0.78rem;
color: var(--text-faint);
white-space: nowrap;
}
/* Non-'ok' capture flag — a dashed chip, echoing the entry-page note. */
#markdownBody .archive-quality-flag {
padding: 0.05em 0.4em;
border: 1px dashed var(--border-muted);
border-radius: 2px;
font-variant: all-small-caps;
font-feature-settings: "smcp" 1;
letter-spacing: 0.04em;
color: var(--text-muted);
}
/* A rotted entry is the one health state worth a solid, inked flag. */
#markdownBody .archive-quality-flag--rotted {
border-style: solid;
border-color: var(--text);
color: var(--text);
}
#markdownBody .archive-empty {
font-family: var(--font-serif);
font-style: italic;
color: var(--text-muted);
}
/* ============================================================
MOBILE
Collapse the provenance grid to stacked rows; trim the frame.
============================================================ */
@media (max-width: 540px) {
#markdownBody .archive-meta {
grid-template-columns: 1fr;
gap: 0;
}
#markdownBody .archive-meta dt {
margin-top: 0.55rem;
}
#markdownBody .archive-meta dt:first-of-type {
margin-top: 0;
}
#markdownBody .archive-frame {
height: 70vh;
}
}

View File

@ -1849,3 +1849,50 @@ pre:hover .copy-btn,
min-height: 300px;
}
}
/* Archive affordance
The superscript "A" appended after a body link whose target is preserved
in the local archive (build/Filters/Archive.hs). Loaded site-wide because
the marker appears in essay/prose content, not on archive pages. */
.archive-affordance {
font-size: 0.7em;
margin-left: 0.15em;
line-height: 0;
}
.archive-affordance a {
font-family: var(--font-sans);
font-weight: 600;
text-decoration: none;
color: var(--text-faint);
border: 1px solid var(--border-muted);
border-radius: 2px;
padding: 0 0.25em;
}
.archive-affordance a:hover {
color: var(--text);
border-color: var(--text-muted);
background: var(--bg-subtle);
}
/* Dead-link flip a body link whose archived target is `rotted` has its
href redirected to the local copy (build/Filters/Archive.hs). A dotted
underline marks the link as redirected; its marker becomes a solid chip
reading "archived" rather than the quiet bordered "A". */
.archive-rotted {
text-decoration-style: dotted;
}
.archive-affordance--rotted a {
color: var(--bg);
background: var(--text-muted);
border-color: var(--text-muted);
}
.archive-affordance--rotted a:hover {
color: var(--bg);
background: var(--text);
border-color: var(--text);
}

View File

@ -0,0 +1,23 @@
<div id="content">
<main id="markdownBody" data-pagefind-body>
<header class="archive-index-header">
<h1 class="page-title">$title$</h1>
<p class="archive-index-intro">Local snapshots of works referenced across the site, preserved against link rot. Each is an archived copy; the original is linked prominently from its page.</p>
</header>
$if(has-entries)$
<ul class="archive-list">
$for(entries)$
<li class="archive-list-item">
<a class="archive-list-link" href="$entry-url$">$entry-title$</a>
<span class="archive-list-meta">$entry-type$ &middot; archived $entry-archived$$if(entry-degraded)$ &middot; <span class="archive-quality-flag">$entry-quality$ capture</span>$endif$$if(entry-private)$ &middot; <span class="archive-quality-flag">private</span>$endif$$if(entry-rotted)$ &middot; <span class="archive-quality-flag archive-quality-flag--rotted">link rotted</span>$endif$</span>
</li>
$endfor$
</ul>
$else$
<p class="archive-empty">Nothing archived yet.</p>
$endif$
$partial("templates/partials/archive-removal-notice.html")$
</main>
</div>

109
templates/archive.html Normal file
View File

@ -0,0 +1,109 @@
<div id="content">
<main id="markdownBody" data-pagefind-body data-pagefind-filter="type:archive, status:$status$">
<article class="archive-entry">
<header class="archive-header">
<h1 class="page-title">$title$</h1>
$partial("templates/partials/archive-banner.html")$
$if(status-note)$
<p class="archive-status-note archive-status-note--$status$" role="note">
$status-note$
</p>
$endif$
$if(degraded)$
<p class="archive-degraded" role="note">
<span class="archive-degraded-label">Capture: $snapshot-quality$</span>
Some of the original's content (images, scripted elements)
may be missing or incomplete in this snapshot. The original
is linked above.
</p>
$endif$
</header>
<section class="archive-provenance" aria-label="Provenance">
<h2 class="archive-panel-title">Provenance</h2>
<dl class="archive-meta">
<dt>Original</dt>
<dd><a href="$original-url$" rel="noopener noreferrer" target="_blank">$original-url$</a></dd>
<dt>Link status</dt>
<dd class="archive-status archive-status--$status$">$status$</dd>
<dt>Archived</dt>
<dd>$archived$</dd>
<dt>Type</dt>
<dd>$archive-type$</dd>
<dt>Snapshot quality</dt>
<dd>$snapshot-quality$</dd>
<dt>Size</dt>
<dd>$size$</dd>
<dt>SHA-256</dt>
<dd><code>$sha-short$&hellip;</code></dd>
$if(wayback)$
<dt>Wayback</dt>
<dd><a href="$wayback$" rel="noopener noreferrer" target="_blank">web.archive.org copy</a></dd>
$endif$
$if(paywalled)$
<dt>Access</dt>
<dd>The original sits behind a paywall.</dd>
$endif$
$if(private)$
<dt>Visibility</dt>
<dd>private &mdash; held offline</dd>
$endif$
</dl>
</section>
$if(note)$<p class="archive-note">$note$</p>$endif$
$if(private)$
<p class="archive-private" role="note">
This work is archived <strong>privately</strong>: a local
preservation copy is kept against link rot, but the artifact
is not published here. Use the original link above to read it.
</p>
$else$
<div class="archive-viewer">
<div class="archive-viewer-bar">
<span class="archive-viewer-name">$artifact-name$</span>
<a class="archive-viewer-open" href="$artifact-url$" target="_blank" rel="noopener noreferrer">Open raw&nbsp;&#8599;</a>
</div>
$if(is-pdf)$
<iframe class="archive-frame" src="$artifact-url$" title="$title$ &mdash; archived document" loading="lazy"></iframe>
$endif$
$if(is-html)$
<iframe class="archive-frame" src="$artifact-url$" title="$title$ &mdash; archived snapshot" sandbox referrerpolicy="no-referrer" loading="lazy"></iframe>
$endif$
</div>
$endif$
$if(fulltext)$
$if(is-pdf)$
<details class="archive-fulltext-wrap">
<summary class="archive-fulltext-title">Full text (extracted)</summary>
$fulltext$
</details>
$endif$
$if(is-html)$
<section class="archive-fulltext-wrap">
<h2 class="archive-fulltext-title">Readable text (extracted)</h2>
$fulltext$
</section>
$endif$
$endif$
$if(referenced-by)$
<section class="archive-backlinks">
<h2 class="archive-section-title">Referenced by</h2>
$referenced-by$
</section>
$endif$
$if(similar-links)$
<section class="archive-related">
<h2 class="archive-section-title">Related</h2>
$similar-links$
</section>
$endif$
$partial("templates/partials/archive-removal-notice.html")$
</article>
</main>
</div>

View File

@ -0,0 +1,5 @@
<div class="archive-banner" role="note">
<p class="archive-banner-label">Archived copy</p>
<p class="archive-banner-text">A local preservation snapshot taken $archived$ &mdash; this page is not the original.</p>
<a class="archive-banner-original" href="$original-url$" rel="noopener noreferrer" target="_blank">View the original&nbsp;&#8599;</a>
</div>

View File

@ -0,0 +1,5 @@
<p class="archive-removal">
This is an archived copy, preserved so that a work cited across the site
survives the original going dark. To request removal, email
<a href="mailto:ln@levineuwirth.org">ln@levineuwirth.org</a>.
</p>

View File

@ -2,6 +2,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1">
$if(home)$<title>Levi Neuwirth</title>$else$$if(title)$<title>$title$ — Levi Neuwirth</title>$else$<title>Levi Neuwirth</title>$endif$$endif$
$if(description)$<meta name="description" content="$description$">$endif$
$if(noindex)$<meta name="robots" content="noindex">$endif$
<link rel="canonical" href="$site-url$$url$">
<link rel="alternate" type="application/atom+xml" title="Levi Neuwirth" href="/feed.xml">
<link rel="alternate" type="application/atom+xml" title="Levi Neuwirth — music" href="/music/feed.xml">
@ -49,6 +50,7 @@ $if(build)$<link rel="stylesheet" href="/css/build.css">$endif$
$if(reading)$<link rel="stylesheet" href="/css/reading.css">$endif$
$if(composition)$<link rel="stylesheet" href="/css/score-reader.css">$endif$
$if(photography)$<link rel="stylesheet" href="/css/photography.css">$endif$
$if(archive)$<link rel="stylesheet" href="/css/archive.css">$endif$
$if(photography-map)$<link rel="stylesheet" href="/leaflet/leaflet.css">$endif$
$if(photography-map)$<link rel="stylesheet" href="/leaflet/MarkerCluster.css">$endif$
$if(photography-map)$<link rel="stylesheet" href="/leaflet/MarkerCluster.Default.css">$endif$

1151
tools/archive.py Normal file

File diff suppressed because it is too large Load Diff

BIN
tools/bin/monolith Executable file

Binary file not shown.

View File

@ -48,7 +48,16 @@ MIN_SCORE = 0.30 # similar-links: discard weak matches
MIN_PARA_CHARS = 80 # semantic: skip very short paragraphs
MAX_PARA_CHARS = 1000 # semantic: truncate before embedding
EXCLUDE_URLS = {"/search/", "/build/", "/404.html", "/feed.xml", "/music/feed.xml"}
# /archive/ is the archive index — a list page that would dominate every
# entry's "Related" set; the individual /archive/<slug>/ pages stay in.
EXCLUDE_URLS = {"/search/", "/build/", "/404.html", "/feed.xml",
"/music/feed.xml", "/archive/"}
# Whole subtrees kept out of the corpus. /source/ is the repository code
# mirror — source files, not content; left in, they pollute every page's
# "Related" set and semantic search (e.g. a template file surfacing as a
# neighbour, titled with its unrendered "$title$" placeholder).
EXCLUDE_PREFIXES = ("/source/",)
# Pages whose <body data-portal> are portal/landing pages — they aggregate
# excerpts from many entries and would otherwise dominate every page's
@ -122,7 +131,7 @@ def extract_page(html_path: Path) -> dict | None:
soup = BeautifulSoup(raw, "html.parser")
url = _url_from_path(html_path)
if url in EXCLUDE_URLS:
if url in EXCLUDE_URLS or url.startswith(EXCLUDE_PREFIXES):
return None
body_tag = soup.body
if body_tag is not None and body_tag.has_attr(PORTAL_BODY_ATTR):

View File

@ -0,0 +1,17 @@
# Pinned monolith binary — the HTML-snapshot tool for the link archive.
#
# Unlike PDF.js / Leaflet (servable assets downloaded at build time and
# gitignored), monolith is a build-time *executable*: the binary itself is
# committed at tools/bin/monolith so `git clone` -> `make build` needs no
# network fetch and stays reproducible from a bare clone. See ARCHIVE.md.
#
# To re-vendor (version bump, or a build host on a different architecture):
# 1. Download the matching asset from
# https://github.com/Y2Z/monolith/releases
# 2. Place it at tools/bin/monolith and `chmod +x`.
# 3. Update the three values below; verify `tools/bin/monolith --version`.
# 4. Commit the binary and this file together.
version = 2.10.1
asset = monolith-gnu-linux-x86_64
sha256 = 663ca914b078e91d5a854b4a07e913c613bbbcfe8fb11a24da1a6ab23c9205df