Add link archive system: snapshots, backlinks, link-rot
Preserve external works the site cites against link rot, host them at permanent /archive/<slug>/ URLs in site chrome, and treat them as first-class citizens of the backlinks and similar-pages indexes. Curated, not crawled: the author adds one line to archive/manifest.yaml and the build fetches, hashes, snapshots, and indexes the work. * archive/manifest.yaml + tools/archive.py (fetch / refresh / wayback / check / gc) — PDFs downloaded directly, HTML pages snapshotted with a vendored monolith (tools/bin/monolith @ 2.10.1) into a single self-contained file with the archive CSP and a noarchive robots meta injected. Per-entry PROVENANCE.json committed; gitignored .txt sidecars regenerated from the artifact's SHA-256. * build/Archive.hs + build/ArchiveIndex.hs + build/Filters/Archive.hs — Hakyll rules for /archive/ and /archive/<slug>/, a body Pandoc filter that appends an archive affordance to live citations and flips dead ones to the local copy on archive.py check's asymmetric hysteresis (rotted needs 3 fails over >= 14 days; one ok recovers). * build/Backlinks.hs — keeps archived external URLs through pass 1 and canonicalises them to /archive/<slug>/ in pass 2, producing a "Referenced by" section grouped by the fragment each citation targets. build/Stats.hs gains a "Link archive" telemetry block on /build/ (count, total size, median age, by-status / by-quality / by-visibility, orphans). * Integrity: archive.py fetch and build/Archive.hs (via sha256sum) both re-hash every committed artifact, so a tampered file halts the build even with cabal invoked directly or no .venv present. refresh refuses to replace an uncommitted prior snapshot and rolls back atomically on any exit path. removed.yaml is honoured by fetch, wayback, and check using canonical-form (tracking-stripped, arXiv-canonicalised) comparison. * visibility: private keeps an entry in-repo but undeployed. nginx/archive.conf emits X-Robots-Tag: noindex, noarchive for raw artifacts that cannot carry meta directives. The full design, phase plan (1-5), and three refinement passes live in ARCHIVE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
14c881b9e4
commit
77e31efdae
|
|
@ -69,10 +69,20 @@ data/similar-links.json
|
|||
data/backlinks.json
|
||||
data/build-stats.json
|
||||
data/build-start.txt
|
||||
data/build-stamp.txt
|
||||
data/last-build-seconds.txt
|
||||
data/semantic-index.bin
|
||||
data/semantic-meta.json
|
||||
|
||||
# Archive: generated text + its staleness stamp (recreated from the
|
||||
# committed artifact on every build — deterministic, so committing them is
|
||||
# churn). archive/**/PROVENANCE.json is deliberately NOT ignored — it is
|
||||
# the committed, immutable record of each archival event.
|
||||
archive/**/*.txt
|
||||
archive/**/*.txt.sha256
|
||||
data/archive-index.json
|
||||
data/archive-state.json
|
||||
|
||||
# IGNORE.txt is for the local build and need not be synced.
|
||||
IGNORE.txt
|
||||
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load Diff
44
Makefile
44
Makefile
|
|
@ -1,4 +1,4 @@
|
|||
.PHONY: build deploy sign download-model download-pdfjs download-leaflet compress-assets convert-images pdf-thumbs pdfs watch clean dev
|
||||
.PHONY: build deploy sign download-model download-pdfjs download-leaflet compress-assets convert-images pdf-thumbs pdfs watch clean dev archive-gc archive-wayback archive-check
|
||||
|
||||
# Source .env for deploy / GitHub config if it exists.
|
||||
# .env format: KEY=value (one per line, no `export` prefix, no quotes needed).
|
||||
|
|
@ -43,6 +43,16 @@ build:
|
|||
else \
|
||||
echo "Photography sidecars skipped: run 'uv sync' to enable EXIF + palette + dimension extraction (build continues with frontmatter only)"; \
|
||||
fi
|
||||
# Archive pipeline (Phase 1): fetch any manifest URL without a local
|
||||
# artifact, extract text, write archive/<slug>/PROVENANCE.json and
|
||||
# data/archive-index.json. Gated on .venv, same as embed.py. A SHA or
|
||||
# slug-URL integrity error exits non-zero and halts the build; a
|
||||
# transient network failure is non-fatal (the entry retries next build).
|
||||
@if [ -d .venv ]; then \
|
||||
uv run python tools/archive.py fetch; \
|
||||
else \
|
||||
echo "Archive fetch skipped: run 'uv sync' to enable link archiving (build continues)"; \
|
||||
fi
|
||||
cabal run site -- build
|
||||
pagefind --site _site
|
||||
@if [ -d .venv ]; then \
|
||||
|
|
@ -153,6 +163,38 @@ watch:
|
|||
clean:
|
||||
cabal run site -- clean
|
||||
|
||||
# Evict archived works: delete archive/<slug>/ directories whose slug is
|
||||
# recorded in archive/removed.yaml. Opt-in — NEVER run by `make build`.
|
||||
# Orphan directories (not in manifest.yaml, not in removed.yaml) are
|
||||
# reported, never deleted. See ARCHIVE.md - Eviction & removal.
|
||||
archive-gc:
|
||||
@if [ -d .venv ]; then \
|
||||
uv run python tools/archive.py gc; \
|
||||
else \
|
||||
python3 tools/archive.py gc; \
|
||||
fi
|
||||
|
||||
# Submit archived URLs to the Wayback Machine and backfill the capture URL
|
||||
# into each PROVENANCE.json. A slow network job — opt-in, never run by
|
||||
# `make build`. Always exits 0; an entry without a capture retries next run.
|
||||
archive-wayback:
|
||||
@if [ -d .venv ]; then \
|
||||
uv run python tools/archive.py wayback; \
|
||||
else \
|
||||
python3 tools/archive.py wayback; \
|
||||
fi
|
||||
|
||||
# Probe every archived URL for link rot, updating data/archive-state.json.
|
||||
# A slow network job — opt-in, never run by `make build`. Asymmetric
|
||||
# hysteresis: `rotted` needs 3 consecutive failures over >=14 days; a
|
||||
# single success recovers immediately. The next build consumes the state.
|
||||
archive-check:
|
||||
@if [ -d .venv ]; then \
|
||||
uv run python tools/archive.py check; \
|
||||
else \
|
||||
python3 tools/archive.py check; \
|
||||
fi
|
||||
|
||||
# Dev build includes any in-progress drafts under content/drafts/essays/.
|
||||
# SITE_ENV=dev is read by build/Site.hs; drafts are otherwise invisible to
|
||||
# every build (make build / make deploy / cabal run site -- build directly).
|
||||
|
|
|
|||
|
|
@ -0,0 +1,14 @@
|
|||
{
|
||||
"url": "https://cr.yp.to/aes-speed.html",
|
||||
"slug": "djb-aes-speed",
|
||||
"title": "Cache-timing attacks on AES (cr.yp.to)",
|
||||
"type": "html",
|
||||
"artifact": "snapshot.html",
|
||||
"sha256": "8da2d5aedeccf9f602e1680631aa77308683803c0cc9b04caad52c7a70c60832",
|
||||
"previous-sha256": "0a50bf6d64b2ec08771d83be5ef47721ecbfc431e3512ff55978e76f452dbd3f",
|
||||
"bytes": 26186,
|
||||
"archived": "2026-05-23",
|
||||
"source-date": null,
|
||||
"snapshot-quality": "ok",
|
||||
"wayback": null
|
||||
}
|
||||
|
|
@ -0,0 +1,470 @@
|
|||
<!-- Saved from https://cr.yp.to/aes-speed.html at 2026-05-23T13:04:33Z using monolith v2.10.1 -->
|
||||
<html><head><meta content="default-src 'none'; img-src data:; style-src 'unsafe-inline'; style-src-elem 'unsafe-inline'; style-src-attr 'unsafe-inline'; font-src data:; script-src 'none'; object-src 'none'; frame-src 'none'" http-equiv="Content-Security-Policy"/><meta content="noindex, noarchive" name="robots"/><link href="data:text/html;base64,PGh0bWw+PGJvZHk+ZmlsZSBkb2VzIG5vdCBleGlzdDwvYm9keT48L2h0bWw+DQo=" rel="icon"/></head><body>
|
||||
<title>AES speed</title>
|
||||
<meta content="aes" name="keywords"/>
|
||||
<a href="https://cr.yp.to/djb.html">D. J. Bernstein</a>
|
||||
<br/><a href="https://cr.yp.to/hash.html">Hash functions and ciphers</a>
|
||||
<h1>AES speed</h1>
|
||||
<b>Update:</b>
|
||||
Peter Schwabe and I now have a paper on this topic:
|
||||
<ul>
|
||||
<li>
|
||||
<a name="aesspeed-paper">[aesspeed]</a>
|
||||
15pp.
|
||||
<a href="https://cr.yp.to/aes-speed/aesspeed-20080926.pdf">(PDF)</a>
|
||||
D. J. Bernstein, Peter Schwabe.
|
||||
New AES software speed records.
|
||||
Document ID: b90c51d2f7eef86b78068511135a231f.
|
||||
URL: https://cr.yp.to/papers.html#aesspeed.
|
||||
Date: 2008.09.26.
|
||||
Supersedes:
|
||||
<a href="https://cr.yp.to/aes-speed/aesspeed-20080908.pdf">(PDF)</a>
|
||||
2008.09.08.
|
||||
</li></ul>
|
||||
The software is now available as part of the
|
||||
<a href="https://cr.yp.to/streamciphers/timings.html#toolkit-estreambench">estreambench</a>
|
||||
toolkit.
|
||||
We have placed the software into the public domain;
|
||||
feel free to integrate it into your own AES applications!
|
||||
<p>
|
||||
Information below this line has not yet been updated.
|
||||
</p><hr/>
|
||||
This document describes various speedups in AES software.
|
||||
This document assumes that
|
||||
the software is going to be used in an application
|
||||
where timing information is <i>not</i> exposed to attackers.
|
||||
<p>
|
||||
The reader is expected to already know the standard structure of AES software:
|
||||
</p><ul>
|
||||
<li>each of the 16 state bytes is used as an index for a table lookup producing a 32-bit word;
|
||||
</li><li>16 xors combine these 16 words and 4 expanded key words into 4 new state words;
|
||||
</li><li>those 4 words are viewed as the starting 16 bytes for the next round.
|
||||
</li></ul>
|
||||
See Section 5.2.1 of "AES Proposal: Rijndael" by Daemen and Rijmen.
|
||||
<h2>Endianness</h2>
|
||||
On a little-endian CPU,
|
||||
extracting the first byte of a 32-bit word
|
||||
is an &0xff arithmetic instruction;
|
||||
on a big-endian CPU,
|
||||
extracting the first byte of a 32-bit word
|
||||
is a >>24 arithmetic instruction.
|
||||
Similar comments apply to the other bytes.
|
||||
<p>
|
||||
One can write AES software
|
||||
that uses arithmetic instructions as if the CPU were little-endian.
|
||||
If the CPU is actually big-endian,
|
||||
the software swaps the bytes of the AES key, input, and output (at run time).
|
||||
The software also swaps the bytes of the table (at compile time),
|
||||
for example by expressing the table as a sequence of 32-bit integers.
|
||||
</p><p>
|
||||
<b>Matched endianness.</b>
|
||||
One can easily eliminate the byte-swapping time for the AES key, input, and output:
|
||||
simply use the appropriate arithmetic instructions
|
||||
for the endianness of the CPU.
|
||||
In this case the table must not be swapped.
|
||||
</p><h2>Table structure</h2>
|
||||
All else being equal, smaller AES tables are faster:
|
||||
they take less time to load into cache and are more likely to stay in cache.
|
||||
Beware that most benchmarking tools preload caches and thus can't see this speedup.
|
||||
<p>
|
||||
Daemen and Rijmen suggest "4 KBytes of tables."
|
||||
There are 4 tables.
|
||||
Each table has 256 words occupying 1024 bytes.
|
||||
The loads are spread evenly across the tables.
|
||||
</p><p>
|
||||
<b>Rotated lookups.</b>
|
||||
Daemen and Rijmen suggest an alternative "with a total table size of 1KByte"
|
||||
but with extra arithmetic.
|
||||
The point is that the tables are rotations of each other:
|
||||
for example,
|
||||
the first word of the first table is (0xc6,0x63,0x63,0xa5),
|
||||
the first word of the second table is (0xa5,0xc6,0x63,0x63),
|
||||
the first word of the third table is (0x63,0xa5,0xc6,0x63),
|
||||
and the first word of the fourth table is (0x63,0x63,0xa5,0xc6).
|
||||
One can store the first table,
|
||||
and simulate a lookup in another table at the cost of an extra rotation.
|
||||
</p><p>
|
||||
<b>Unaligned loads.</b>
|
||||
One can instead use a single 2KB table having 256 8-byte entries
|
||||
such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6).
|
||||
There are many reasonable choices of pattern here;
|
||||
what's important is that the pattern includes the desired
|
||||
(0xc6,0x63,0x63,0xa5) and (0xa5,0xc6,0x63,0x63) and so on as substrings.
|
||||
On the Pentium, the PowerPC, et al.,
|
||||
one can load 4-byte words from memory addresses that aren't divisible by 4,
|
||||
and there's no penalty when the word doesn't cross an 8-byte boundary.
|
||||
</p><h2>Masked loads</h2>
|
||||
16 of the 160 table lookups in 10-round AES are masked.
|
||||
The 40 table lookups in 10-round AES key expansion are also masked.
|
||||
The masks are 0x000000ff, 0x0000ff00, 0x00ff0000, and 0xff000000, each used equally often.
|
||||
<p>
|
||||
The simplest way to compute a mask is with an arithmetic instruction: for example, &0xff00.
|
||||
</p><p>
|
||||
<b>Byte loads.</b>
|
||||
One can eliminate 25% of the masks,
|
||||
namely the bottom-byte masks,
|
||||
by combining them with load instructions.
|
||||
All popular CPUs have single-byte-load instructions.
|
||||
</p><p>
|
||||
<b>Two-byte loads.</b>
|
||||
One can eliminate another 25% of the masks
|
||||
on CPUs with two-byte-load instructions.
|
||||
This constrains the table pattern:
|
||||
it's important to have (0x00,0x63) on little-endian CPUs,
|
||||
and (0x63,0x00) on big-endian CPUs.
|
||||
</p><p>
|
||||
<b>Masked tables.</b>
|
||||
One can eliminate all of the masks by precomputing masked tables, using extra table space.
|
||||
The simplest table structure uses a total of 8KB.
|
||||
Two tables, one with entries such as (0x00,0x63,0xa5,0xc6,0x63,0x63,0xa5,0xc6)
|
||||
and another with entries such as (0x00,0x00,0x00,0x00,0x63,0x00,0x00,0x00),
|
||||
use a total of 4KB.
|
||||
In my experience,
|
||||
the cost of larger tables outweighs the benefit of eliminating a few masks.
|
||||
</p><h2>Key expansion</h2>
|
||||
A 4-word (128-bit) key is expanded in 40 steps.
|
||||
Each step produces a new word, totalling 44 words in the expanded key.
|
||||
A step has a byte extraction (see below), a masked load, and two xors.
|
||||
The total work is 40 byte extractions, 40 masked loads, and 80 xors.
|
||||
For comparison, the subsequent work to encrypt a block involves
|
||||
160 byte extractions, 160 loads (of which 16 are masked), and 160 xors.
|
||||
<p>
|
||||
Daemen and Rijmen say (Section 4.3.2)
|
||||
that key expansion involves "almost no computational overhead."
|
||||
Obviously key expansion is less expensive than encrypting a block.
|
||||
On the other hand, the cost of key expansion is still quite noticeable.
|
||||
</p><p>
|
||||
<b>Expanded keys.</b>
|
||||
A typical AES implementation precomputes and stores an expanded key.
|
||||
The 40 byte extractions, 40 masked loads, and 80 xors aren't repeated for every block;
|
||||
they are done only once, along with 44 stores.
|
||||
Each block then involves 44 extra loads for the expanded key.
|
||||
Some stores and loads can be eliminated
|
||||
if many blocks are handled at once
|
||||
and some extra registers are available.
|
||||
</p><p>
|
||||
Long-term storage of an expanded key can slow down applications that handle many keys:
|
||||
the expanded keys take more time to load into cache
|
||||
than the original keys and are less likely to stay in cache.
|
||||
</p><p>
|
||||
<b>Partially expanded keys.</b>
|
||||
An alternative is to precompute and store a partially expanded key,
|
||||
only 14 words instead of 44 words.
|
||||
The partially expanded key consists of words
|
||||
0, 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40 from the expanded key.
|
||||
Loading the partially expanded key, and converting it into the fully expanded key,
|
||||
takes only 14 loads and 30 xors.
|
||||
</p><p>
|
||||
One can interpolate between partial expansion and full expansion,
|
||||
using various amounts of storage per key and achieving various balances between load and xor.
|
||||
</p><h2>Index extraction</h2>
|
||||
The 16 xor operations in an AES round
|
||||
produce 4 words in 4 integer registers.
|
||||
The 16 bytes of these words are then extracted and used as indices for the next round.
|
||||
<p>
|
||||
The simplest way to extract 4 bytes is using 6 instructions,
|
||||
namely 3 shifts and 3 bottom-byte extractions:
|
||||
&255;
|
||||
(>>8)&255;
|
||||
(>>16)&255;
|
||||
>>24.
|
||||
</p><p>
|
||||
Using a byte as an index then requires multiplying the byte by a constant
|
||||
that depends on the table structure.
|
||||
Let's assume the 2KB tables described above; then the constant is 8.
|
||||
The multiplications use 4 shifts:
|
||||
<<3;
|
||||
<<3;
|
||||
<<3;
|
||||
<<3.
|
||||
</p><p>
|
||||
<b>Scaled-index loads.</b>
|
||||
Many CPUs can multiply an index register by 8 for free as part of a load.
|
||||
</p><p>
|
||||
<b>Scaled-index extractions.</b>
|
||||
What about CPUs that can't multiply an index register by 8 for free?
|
||||
Two of the multiplications can nevertheless be eliminated,
|
||||
because they can be combined with shifts.
|
||||
The overall extract-and-scale sequence has 8 instructions:
|
||||
(<<3)&2040;
|
||||
(>>5)&2040;
|
||||
(>>13)&2040;
|
||||
(>>21)&2040.
|
||||
The PowerPC has a combined rotate-and-mask instruction,
|
||||
making this sequence take only 4 instructions.
|
||||
</p><p>
|
||||
<b>Scaled tables.</b>
|
||||
One can rotate table entries by 3 bits,
|
||||
reducing the above 8 instructions to 7 instructions.
|
||||
</p><p>
|
||||
<b>Second-byte instructions.</b>
|
||||
The x86 architecture (Pentium, Athlon, etc.)
|
||||
includes a combined (>>8)&255 instruction.
|
||||
This means that extracting 4 bytes takes only 5 instructions:
|
||||
&255;
|
||||
(>>8)&255;
|
||||
>>16;
|
||||
&255;
|
||||
>>8.
|
||||
Alternate 5-instruction sequence:
|
||||
&255;
|
||||
(>>8)&255;
|
||||
>>16;
|
||||
&255;
|
||||
(>>8)&255.
|
||||
</p><p>
|
||||
Of course, the ultimate measure of performance is a cycle count, not an instruction count.
|
||||
Matsui states that the (>>8)&255; instruction is "a bit expensive"
|
||||
on the Pentium 4 Prescott (f33, f34, f41);
|
||||
presumably this means that the instruction takes more cycles than, e.g., a mere &255.
|
||||
But all of the measurements I've seen indicate the opposite.
|
||||
I'm not sure what I'm missing here.
|
||||
</p><p>
|
||||
<b>32-bit shifts on 64-bit architectures.</b>
|
||||
The amd64 architecture (P4E, Athlon 64, Core 2, etc.) can right-shift a 64-bit register,
|
||||
but Matsui comments that this operation is extremely slow on the P4E.
|
||||
It's much better to use the amd64's x86-compatible right-shift instruction;
|
||||
this instruction sets the top 32 bits of its 64-bit input to 0 before shifting.
|
||||
</p><p>
|
||||
<b>Byte extraction via loads.</b>
|
||||
A completely different way to extract 4 bytes is with 1 store and 4 loads.
|
||||
One can mix this with the previous approaches
|
||||
to achieve various balances between load and arithmetic.
|
||||
</p><p>
|
||||
Consider, for example, the UltraSPARC,
|
||||
which has 2 integer units and 1 load/store unit.
|
||||
A traditional sequence of
|
||||
14 partially-expanded-key loads (see below), 30 key-expansion xors,
|
||||
160 scaled-index extractions, 160 table-lookup loads, 160 xors, 16 masks,
|
||||
4 input loads, and 4 output stores
|
||||
occupies a total of 526 integer instructions (at least 263 cycles)
|
||||
and 182 loads (at least 182 cycles).
|
||||
Using loads for some byte extractions,
|
||||
replacing 36 scaled-index extractions with 9 stores and 36 loads,
|
||||
means a total of 454 integer instructions (at least 227 cycles)
|
||||
and 227 loads/stores (at least 227 cycles).
|
||||
</p><h2>Unrolling</h2>
|
||||
A typical 9-iteration AES loop
|
||||
involves 9 increments of a loop index, 9 comparisons, and 9 branches,
|
||||
one of which is mispredicted on most CPUs.
|
||||
The loop index also consumes a register,
|
||||
forcing an extra 9 stores and 9 loads on CPUs that don't have registers to spare.
|
||||
<p>
|
||||
<b>Full unrolling.</b>
|
||||
One can eliminate all of these costs by fully unrolling the loop.
|
||||
Beware, however, that full unrolling costs a few kilobytes of code-cache space.
|
||||
</p><p>
|
||||
<b>Partial unrolling.</b>
|
||||
CPUs are more likely to correctly predict a 4-iteration loop than a 9-iteration loop.
|
||||
</p><h2>Instruction scheduling</h2>
|
||||
The 16 table lookups in an AES round are independent
|
||||
and can be scheduled in many different ways.
|
||||
One can, for example,
|
||||
perform all the table lookups for the first input from bottom byte to top
|
||||
(outputs 0, 3, 2, 1),
|
||||
then perform all the table lookups for the second input from bottom byte to top
|
||||
(outputs 1, 0, 3, 2),
|
||||
then perform all the table lookups for the third input from bottom byte to top
|
||||
(outputs 2, 1, 0, 3),
|
||||
then perform all the table lookups for the fourth input from bottom byte to top
|
||||
(outputs 3, 2, 1, 0).
|
||||
One can, as another example,
|
||||
first perform all the table lookups for the first output in order of the inputs,
|
||||
then perform all the table lookups for the second output in order of the inputs,
|
||||
etc.
|
||||
<p>
|
||||
<b>Maximum parallelism.</b>
|
||||
The overall depth of the AES round is
|
||||
one byte extraction plus one table lookup plus two xors:
|
||||
a mythical CPU offering extensive parallelism
|
||||
could perform all sixteen byte extractions in parallel,
|
||||
then all sixteen table lookups in parallel,
|
||||
then eight xors in parallel,
|
||||
then four xors in parallel.
|
||||
Note that each output is obtained by xor'ing two parallel xor's,
|
||||
rather than by three serial xor's.
|
||||
</p><p>
|
||||
<b>Deferring loads.</b>
|
||||
The amd64 architecture poses several challenges to AES instruction scheduling.
|
||||
First,
|
||||
most integer instructions require the output register to be one of the input registers.
|
||||
Second,
|
||||
typical amd64 CPUs handle a load and xor most efficiently as a unified load-xor,
|
||||
but a unified load-xor gives no opportunity to switch registers.
|
||||
Third,
|
||||
only 4 registers (eax, ebx, ecx, edx) allow second-byte instructions.
|
||||
</p><p>
|
||||
Matsui concludes that, on amd64 (and x86),
|
||||
keeping each round's inputs y0, y1, y2, y3 and outputs z0, z1, z2, z3 in eax, ebx, ecx, edx,
|
||||
to allow second-byte instructions,
|
||||
is "impossible without saving/restoring."
|
||||
But that's incorrect.
|
||||
No extra copies are required.
|
||||
A careful instruction sequence
|
||||
uses the minimal conceivable number of instructions:
|
||||
20 for byte extraction,
|
||||
16 for table lookups,
|
||||
and 4 for handling the expanded key.
|
||||
The idea is to extract all the bytes from an input,
|
||||
freeing the input's register for an output,
|
||||
before doing any table lookups involving that output:
|
||||
</p><ul>
|
||||
<li>Extract the 4 bytes from y0.
|
||||
At this point y1, y2, y3, and the 4 bytes are live.
|
||||
</li><li>Feed 1 byte into z0.
|
||||
At this point y1, y2, y3, z0, and 3 more bytes are live.
|
||||
</li><li>Extract the 4 bytes from y1, immediately feeding 1 into z0.
|
||||
At this point y2, y3, z0, and 6 more bytes are live.
|
||||
</li><li>Feed 2 bytes into z1.
|
||||
At this point y2, y3, z0, z1, and 4 more bytes are live.
|
||||
</li><li>Extract the 4 bytes from y2, immediately feeding 2 into z0 and z1.
|
||||
At this point y3, z0, z1, and 6 more bytes are live.
|
||||
</li><li>Feed 3 bytes into z2.
|
||||
At this point y3, z0, z1, z2, and 3 more bytes are live.
|
||||
</li><li>Extract the 4 bytes from y3, immediately feeding 3 into z0, z1, and z2.
|
||||
At this point z0, z1, z2, and 4 more bytes are live.
|
||||
</li><li>Feed 4 bytes into z3.
|
||||
At this point z0, z1, z2, and z3 are live.
|
||||
</li><li>Handle 4 words of the expanded key.
|
||||
</li></ul>
|
||||
The maximum number of live registers here is 9,
|
||||
fitting easily into the amd64 instruction set.
|
||||
<p>
|
||||
<b>Squeezing inputs and outputs into 7 32-bit registers.</b>
|
||||
The x86 architecture poses an additional challenge to AES instruction scheduling:
|
||||
there are only 7 general-purpose integer registers.
|
||||
</p><p>
|
||||
It's still possible to handle a round with 0 stores, 4 expanded-key loads,
|
||||
and 16 loads for table lookups.
|
||||
The shortest instruction sequence that I know has a total of 46 instructions,
|
||||
6 more than what would be possible with extra registers;
|
||||
1 of the 46 instructions can be eliminated if the key expansion is changed.
|
||||
</p><p>
|
||||
The idea of this instruction sequence
|
||||
is to rotate y0 by 16 bits,
|
||||
use the bottom two bytes of both y0 and y2,
|
||||
and then merge the remaining four bytes of y0 and y2 into a single register
|
||||
(for example, shifting y0 down 16 bits, masking y1, and adding the results),
|
||||
freeing a register at the cost of 3 extra instructions (the rotate, the mask, and the add);
|
||||
splitting 3 load-xor instructions into 3 loads and 3 xors
|
||||
then easily puts all outputs into suitable registers.
|
||||
The rotation can be eliminated if the expanded-key word that corresponds to y0
|
||||
is rotated by 16 bits.
|
||||
</p><h2>Speed reports</h2>
|
||||
Speed reports vary in whether they use CTR, CBC, etc.,
|
||||
and in the exact rules for measuring speeds.
|
||||
The "eSTREAM" cycles/byte counts are
|
||||
for counter-mode AES measured by the eSTREAM benchmarking toolkit;
|
||||
future implementors are encouraged to support the eSTREAM interface for direct comparability.
|
||||
<table border="">
|
||||
<tbody><tr><th>Architecture</th><th>CPU</th><th>eSTREAM cycles/byte</th><th>Ad-hoc cycles/byte</th><th>Software</th></tr>
|
||||
<tr><td>amd64</td><td>Intel Core 2 Duo (6f6)?</td><td></td><td>9.2</td><td>Matsui/Nakajima (CHES 2007)</td></tr>
|
||||
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)?</td><td></td><td>10.625 (170/block)</td><td>Matsui (FSE 2006)</td></tr>
|
||||
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)?</td><td></td><td>12.4375 (199/block)</td><td>Lipmaa</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Core 2 Duo (6f6); katana</td><td>12.56</td><td></td><td>hongjun/v1/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Core 2 Quad Q6600 (6fb); latour</td><td>12.57</td><td></td><td>hongjun/v1/1</td></tr>
|
||||
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)?</td><td></td><td>13.125 (210/block)</td><td>Osvik</td></tr>
|
||||
<tr><td>amd64</td><td>AMD Athlon 64 X2 (15,75,2); mace</td><td>13.32</td><td></td><td>hongjun/v1/1</td></tr>
|
||||
<tr><td>amd64</td><td>AMD Opteron 240 (f58); nmisles8amd64</td><td>13.45</td><td></td><td>bernstein/amd64-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium III (68a)?</td><td></td><td>14 (224/block)</td><td>Osvik</td></tr>
|
||||
<tr><td>x86</td><td>AMD Athlon (622)?</td><td></td><td>14.0625 (225/block)</td><td>Osvik</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium III (68a)?</td><td></td><td>14.125 (226/block)</td><td>Lipmaa</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f12)?</td><td></td><td>15 (240/block)</td><td>Osvik</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f12)?</td><td></td><td>15.875 (254/block)</td><td>Lipmaa</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium M (695); whisper</td><td>15.96</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Pentium 4 (f64)?</td><td></td><td>16 (256/block)</td><td>Matsui (FSE 2006)</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium III (68a)?</td><td></td><td>16.25 (260/block)</td><td>Gladman</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Pentium D (f64); nmi0161</td><td>16.74</td><td></td><td>bernstein/amd64-2/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Pentium D (f64); svlin001</td><td>16.75</td><td></td><td>bernstein/amd64-2/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Xeon (f41); nmi0056</td><td>16.75</td><td></td><td>bernstein/amd64-2/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Xeon (f4a); nmi0090</td><td>16.77</td><td></td><td>bernstein/amd64-2/1</td></tr>
|
||||
<tr><td>sparc</td><td>Sun UltraSPARC III</td><td></td><td>16.875 (270/block)</td><td>Lipmaa</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Xeon (f41); nmi0057</td><td>16.89</td><td></td><td>bernstein/amd64-2/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Pentium D (f64); speed</td><td>16.90</td><td></td><td>bernstein/amd64-2/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Pentium D (f64); nmi0104</td><td>16.90</td><td></td><td>bernstein/amd64-2/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Pentium D (f64); nmi0241</td><td>16.93</td><td></td><td>bernstein/amd64-2/1</td></tr>
|
||||
<tr><td>ppc64</td><td>IBM POWER5; nmi0154</td><td>16.93</td><td></td><td>bernstein/big-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f24); nmi0086</td><td>16.96</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f12); fireball</td><td>16.98</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f24); nmitest4</td><td>17.01</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>ppc64</td><td>IBM PowerPC G5 970; nmi0048</td><td>17.17</td><td></td><td>bernstein/big-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 2 (652); boris</td><td>17.33</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 3 (68a)</td><td>17.49</td><td></td><td>Bernstein aes-128/x86-mmx-1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 3 (672); orpheus</td><td>17.55</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium M (6d8)</td><td>17.57</td><td></td><td>Wu v0/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f33)?</td><td></td><td>17.75 (284/block)</td><td>Matsui/Fukuda (FSE 2005)</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f29); nmibuild40</td><td>17.79</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f27); nmi0059</td><td>17.79</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild16</td><td>17.79</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f25); nmi0013</td><td>17.79</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f29); nmi0059</td><td>17.80</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f29); nmibuild17</td><td>17.81</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild15</td><td>17.82</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild26</td><td>17.83</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild21</td><td>17.83</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f25); nmi0036</td><td>17.84</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f25); nmibuild22</td><td>17.84</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>AMD Athlon (622); thoth</td><td>18.38</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>ppc32</td><td>IBM POWER4; nmibuild14</td><td>18.55</td><td></td><td>bernstein/little-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f41); nmi0079</td><td>18.88</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f41); nmi0062</td><td>18.89</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Core 2 Duo (6f6)</td><td></td><td>18.9</td><td>OpenSSL 0.9.8e</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f41); nmi0061</td><td>18.91</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f41); svlin002</td><td>18.94</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f41); nmi0076</td><td>18.96</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f4a); nmi0102</td><td>18.97</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f41); nmi0060</td><td>18.97</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Xeon (f41); nmi0063</td><td>18.95</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 3 (68a)</td><td>19.06</td><td></td><td>Wu v1/1</td></tr>
|
||||
<tr><td>ppc32</td><td>Motorola PowerPC G4 7410; gggg</td><td>19.11</td><td></td><td>bernstein/big-1/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Core 2 Duo (6f6)</td><td></td><td>19.5</td><td>OpenSSL 0.9.8a</td></tr>
|
||||
<tr><td>x86</td><td>AMD Athlon (622)?</td><td></td><td>19.9375 (319/block)</td><td>Lipmaa</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 1 (52c)</td><td></td><td>20 (320/block)</td><td>Lipmaa</td></tr>
|
||||
<tr><td>sparc</td><td>Sun UltraSPARC III</td><td>20.75</td><td></td><td>Bernstein big-1/1</td></tr>
|
||||
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)</td><td></td><td>20.9</td><td>OpenSSL 0.9.8e</td></tr>
|
||||
<tr><td>ppc32</td><td>Motorola PowerPC G4 7400; nmi0042</td><td>20.92</td><td></td><td>bernstein/big-1/1</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium M (6d8)</td><td></td><td>21</td><td>OpenSSL 0.9.8a</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium D (f47); shell</td><td>21.58</td><td></td><td>bernstein/x86-mmx-1/1</td></tr>
|
||||
<tr><td>x86</td><td>AMD Athlon (622)</td><td></td><td>22</td><td>OpenSSL 0.9.8a</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f29)</td><td></td><td>22</td><td>OpenSSL 0.9.8b</td></tr>
|
||||
<tr><td>amd64</td><td>AMD Athlon 64 (15,75,2)?</td><td></td><td>23.5</td><td>OpenSSL 0.9.7e</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f41)</td><td></td><td>23.5</td><td>OpenSSL 0.9.8a</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 3 (672); orpheus</td><td></td><td>23.62</td><td>OpenSSL 0.9.8e</td></tr>
|
||||
<tr><td>ppc32</td><td>Motorola PowerPC G4 7410</td><td></td><td>24.0625 (385/block)</td><td>Ahrens</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f12)</td><td></td><td>24.4</td><td>OpenSSL 0.9.8a</td></tr>
|
||||
<tr><td>sparc</td><td>Sun UltraSPARC III</td><td></td><td>25</td><td>OpenSSL</td></tr>
|
||||
<tr><td>ppc32</td><td>Motorola PowerPC G4 7410</td><td></td><td>25.0625 (401/block)</td><td>Ahrens</td></tr>
|
||||
<tr><td>x86</td><td>Intel Core Duo; nmi0068</td><td>25.74</td><td></td><td>gladman/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Pentium D (f64); speed</td><td></td><td>27.33</td><td>OpenSSL 0.9.8e</td></tr>
|
||||
<tr><td>ppc32</td><td>Motorola PowerPC G4 7410; gggg</td><td></td><td>29.32</td><td>OpenSSL 0.9.8c</td></tr>
|
||||
<tr><td>sparcv9</td><td>Sun UltraSPARC III; nmi0051</td><td>29.45</td><td></td><td>bernstein/big-1/1</td></tr>
|
||||
<tr><td>sparcv9</td><td>Sun UltraSPARC III; nmisolaris10</td><td>29.46</td><td></td><td>bernstein/big-1/1</td></tr>
|
||||
<tr><td>ppc64</td><td>IBM Cell PPE; nmips3</td><td>35.20</td><td></td><td>bernstein/big-1/1</td></tr>
|
||||
<tr><td>amd64</td><td>Intel Pentium 4 (f64)</td><td></td><td>37</td><td>OpenSSL 0.9.7f</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 4 (f29)</td><td></td><td>39</td><td>OpenSSL 0.9.7e</td></tr>
|
||||
<tr><td>sparc</td><td>Sun UltraSPARC III</td><td></td><td>46.875 (750/block)</td><td>Bassham</td></tr>
|
||||
<tr><td>x86</td><td>Intel Pentium 1 (52c); cruncher</td><td>38.20</td><td></td><td>hongjun/v1/1</td></tr>
|
||||
</tbody></table>
|
||||
<p>
|
||||
Regarding amd64 Intel Pentium 4,
|
||||
Matsui writes:
|
||||
"The number of memory reads
|
||||
for one block encryption of AES
|
||||
is 4 (for plaintext loads)
|
||||
+ 11 x 4 (for subkey loads)
|
||||
+ 16 x 10 (for table lookups)
|
||||
= 208,
|
||||
which means that Pentium 4 takes at least 208 cycles/block for one block encryption."
|
||||
But this lower bound ignores the possibility of loading partially expanded keys,
|
||||
saving as many as 30 loads,
|
||||
and using 64-bit loads for keys and plaintext,
|
||||
saving 9 more loads.
|
||||
</p><p>
|
||||
Regarding amd64 AMD Athlon 64,
|
||||
Matsui writes:
|
||||
"Considering an instruction latency of Athlon 64, the theoretical limit of AES
|
||||
performance on this processor seems around 16 cycles/round = 160 cycles/block.
|
||||
Our result is hence reaching closely this limit."
|
||||
|
||||
|
||||
</p></body></html>
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
# archive/manifest.yaml — curated list of works to preserve.
|
||||
# Edited by hand. Tools never write to this file. See ARCHIVE.md.
|
||||
#
|
||||
# Per-artifact cap: 25 MB. Above that, archive.py warns and skips the fetch;
|
||||
# commit an oversize artifact deliberately with `git add -f`.
|
||||
#
|
||||
# To evict an entry, see archive/removed.yaml — record there FIRST, then
|
||||
# delete the line here, then run `make archive-gc`.
|
||||
|
||||
- url: "https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf"
|
||||
slug: nist-fips-203
|
||||
title: "FIPS 203 — Module-Lattice-Based Key-Encapsulation Mechanism Standard"
|
||||
type: pdf
|
||||
tags: [research]
|
||||
note: >
|
||||
The ML-KEM standard. Cited in the SIMD / post-quantum systems work;
|
||||
archived so the citation survives any future reorganization of the
|
||||
NIST publications site.
|
||||
|
||||
- url: "https://cr.yp.to/aes-speed.html"
|
||||
slug: djb-aes-speed
|
||||
title: "Cache-timing attacks on AES (cr.yp.to)"
|
||||
# type: html — auto-detected from the .html extension; no override needed.
|
||||
tags: [research]
|
||||
note: >
|
||||
Bernstein's cache-timing-attacks page, cited in the SIMD work. The
|
||||
Phase 2 bootstrap entry: a stable, JavaScript-free static page, so its
|
||||
monolith snapshot is reproducible and classifies cleanly as `ok`.
|
||||
|
|
@ -0,0 +1,14 @@
|
|||
{
|
||||
"url": "https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf",
|
||||
"slug": "nist-fips-203",
|
||||
"title": "FIPS 203 — Module-Lattice-Based Key-Encapsulation Mechanism Standard",
|
||||
"type": "pdf",
|
||||
"artifact": "document.pdf",
|
||||
"sha256": "fe1f12f32a7e44ec9fdebbf400cda843a40b506dee676725234dc6f7923b6cac",
|
||||
"previous-sha256": null,
|
||||
"bytes": 1252341,
|
||||
"archived": "2026-05-22",
|
||||
"source-date": null,
|
||||
"snapshot-quality": "ok",
|
||||
"wayback": "http://web.archive.org/web/20260515100505/https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.203.pdf"
|
||||
}
|
||||
Binary file not shown.
|
|
@ -0,0 +1,19 @@
|
|||
# archive/removed.yaml — record of evicted archive entries.
|
||||
#
|
||||
# Append an entry here BEFORE deleting its line from manifest.yaml, then
|
||||
# run `make archive-gc`. The GC deletes only archive/<slug>/ directories
|
||||
# whose slug is recorded here; an orphaned directory absent from this file
|
||||
# is reported, never deleted. See ARCHIVE.md § Eviction & removal.
|
||||
#
|
||||
# Schema (all fields but `note` required):
|
||||
# url: original URL at time of removal
|
||||
# slug: the archive/<slug>/ directory archive-gc may delete
|
||||
# removed: ISO date of removal
|
||||
# reason: takedown | author-request | legal | quality
|
||||
# note: optional free-text context
|
||||
#
|
||||
# This is not a hostile-tracking list — it exists so GC knows what is safe
|
||||
# to delete, re-adding a removed URL is surfaced loudly, and the link-rot
|
||||
# scanner and `archive-suggest` skip removed works.
|
||||
|
||||
[]
|
||||
|
|
@ -0,0 +1,579 @@
|
|||
{-# LANGUAGE GHC2021 #-}
|
||||
{-# LANGUAGE OverloadedStrings #-}
|
||||
-- | Archive section — the link-archiving system. Phases 1-2: PDF and HTML.
|
||||
--
|
||||
-- Authored input: archive/manifest.yaml (one line per archived link)
|
||||
-- Generated, committed: archive/<slug>/{document.pdf | snapshot.html}
|
||||
-- + PROVENANCE.json
|
||||
-- Generated, gitignored: archive/<slug>/{document,snapshot}.txt
|
||||
-- + data/archive-index.json
|
||||
--
|
||||
-- @tools/archive.py fetch@ runs before the Hakyll build: it downloads
|
||||
-- PDFs / snapshots HTML pages with @monolith@, extracts text, and writes
|
||||
-- each PROVENANCE.json. This module then routes the artifacts and renders
|
||||
-- one @/archive/<slug>/@ page per entry plus the @/archive/@ index.
|
||||
--
|
||||
-- An entry whose artifact has not been fetched (no PROVENANCE.json, or
|
||||
-- no artifact file on disk) is skipped — it produces no page, and an
|
||||
-- orphaned @archive/<slug>/@ directory with no manifest line is inert
|
||||
-- (no page, not deployed). Artifact-integrity (SHA-256) verification
|
||||
-- runs on both sides: @archive.py fetch@ re-hashes before the Hakyll
|
||||
-- build, and 'verifyArtifactSha' (below) re-hashes again in
|
||||
-- 'loadArchiveEntries' — so the guarantee holds even when @archive.py@
|
||||
-- does not run first (no @.venv@, a direct @cabal run site -- build@,
|
||||
-- or a deploy host without the Python toolchain).
|
||||
--
|
||||
-- See @ARCHIVE.md@ at the repo root for the full design and phase plan.
|
||||
module Archive (archiveRules, archiveBuildStats) where
|
||||
|
||||
import Control.Exception (SomeException, catch)
|
||||
import Control.Monad (filterM, forM, when)
|
||||
import Data.Function (on)
|
||||
import Data.List (groupBy, intercalate, sort, sortBy)
|
||||
import qualified Data.Map.Strict as Map
|
||||
import Data.Maybe (catMaybes, fromMaybe)
|
||||
import Data.Ord (Down (..), comparing)
|
||||
import qualified Data.Set as Set
|
||||
import qualified Data.Text as T
|
||||
import Data.Time (Day, diffDays, fromGregorian,
|
||||
getCurrentTime, utctDay)
|
||||
import qualified Data.Aeson as A
|
||||
import Data.Aeson ((.:), (.:?))
|
||||
import qualified Data.Yaml as Y
|
||||
import System.Directory (doesDirectoryExist, doesFileExist,
|
||||
listDirectory)
|
||||
import System.Exit (exitFailure)
|
||||
import System.IO (hPutStrLn, readFile', stderr)
|
||||
import System.Process (readProcess)
|
||||
import Text.Read (readMaybe)
|
||||
import Hakyll
|
||||
import Contexts (siteCtx)
|
||||
import Backlinks (referencedByField)
|
||||
import SimilarLinks (similarLinksField)
|
||||
import ArchiveIndex (ArchiveStatus (..), statusName,
|
||||
archiveStatusForSlug, normalizeUrl)
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Data model
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | One authored entry in @archive/manifest.yaml@ — only the fields this
|
||||
-- module consumes. @title:@, @type:@ and @tags:@ are read by
|
||||
-- @tools/archive.py@ (title and type fold into PROVENANCE.json; tags are
|
||||
-- Phase 4) and need no Haskell-side binding.
|
||||
data ManifestEntry = ManifestEntry
|
||||
{ meUrl :: String
|
||||
, meNote :: Maybe String
|
||||
, mePaywalled :: Bool
|
||||
, meVisibility :: String -- ^ "public" (default) | "private"
|
||||
}
|
||||
|
||||
instance A.FromJSON ManifestEntry where
|
||||
parseJSON = A.withObject "ManifestEntry" $ \o -> do
|
||||
url <- o .: "url"
|
||||
note <- o .:? "note"
|
||||
paywalled <- fromMaybe False <$> o .:? "paywalled"
|
||||
visibility <- fromMaybe "public" <$> o .:? "visibility"
|
||||
-- A publication/privacy field must fail closed: an unknown value
|
||||
-- (e.g. a typo'd "privte") would otherwise be treated as public
|
||||
-- and publish an artifact the author intended to keep offline.
|
||||
when (visibility `notElem` ["public", "private"]) $ fail $
|
||||
"manifest entry " ++ url
|
||||
++ ": visibility must be \"public\" or \"private\", got "
|
||||
++ show visibility
|
||||
return (ManifestEntry url note paywalled visibility)
|
||||
|
||||
newtype RemovedEntry = RemovedEntry { reUrl :: String }
|
||||
|
||||
instance A.FromJSON RemovedEntry where
|
||||
parseJSON = A.withObject "RemovedEntry" $ \o ->
|
||||
RemovedEntry <$> o .: "url"
|
||||
|
||||
-- | One generated @archive/<slug>/PROVENANCE.json@ — the immutable
|
||||
-- record of an archival event, written by @tools/archive.py@.
|
||||
data Provenance = Provenance
|
||||
{ pvUrl :: String
|
||||
, pvSlug :: String
|
||||
, pvTitle :: String
|
||||
, pvType :: String -- ^ "pdf" | "html"
|
||||
, pvArtifact :: String -- ^ "document.pdf" | "snapshot.html"
|
||||
, pvSha256 :: String
|
||||
, pvBytes :: Integer
|
||||
, pvArchived :: String
|
||||
, pvQuality :: String -- ^ "ok" | "degraded" | "js-required"
|
||||
, pvWayback :: Maybe String
|
||||
}
|
||||
|
||||
instance A.FromJSON Provenance where
|
||||
parseJSON = A.withObject "Provenance" $ \o -> Provenance
|
||||
<$> o .: "url"
|
||||
<*> o .: "slug"
|
||||
<*> o .: "title"
|
||||
<*> o .: "type"
|
||||
<*> o .: "artifact"
|
||||
<*> o .: "sha256"
|
||||
<*> o .: "bytes"
|
||||
<*> o .: "archived"
|
||||
<*> (fromMaybe "ok" <$> o .:? "snapshot-quality")
|
||||
<*> o .:? "wayback"
|
||||
|
||||
-- | A renderable archive entry: the authored manifest line joined with
|
||||
-- its generated provenance and extracted full text. @aeTextId@ is the
|
||||
-- on-disk path of the extracted-text sidecar when it exists (it is
|
||||
-- gitignored, so a no-@.venv@ build may lack it).
|
||||
data ArchiveEntry = ArchiveEntry
|
||||
{ aeManifest :: ManifestEntry
|
||||
, aeProv :: Provenance
|
||||
, aeFulltext :: String
|
||||
, aeTextId :: Maybe FilePath
|
||||
, aeStatus :: ArchiveStatus -- ^ link-rot status of the original
|
||||
}
|
||||
|
||||
-- | The extracted-text sidecar name for an artifact type.
|
||||
textFileFor :: Provenance -> String
|
||||
textFileFor pv
|
||||
| pvType pv == "html" = "snapshot.txt"
|
||||
| otherwise = "document.txt"
|
||||
|
||||
-- | True for a @visibility: private@ entry — kept in-repo as a local
|
||||
-- preservation copy, but its artifact is never routed to @_site/@ and
|
||||
-- its extracted text is never rendered into the page.
|
||||
isPrivate :: ArchiveEntry -> Bool
|
||||
isPrivate = (== "private") . meVisibility . aeManifest
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Rule-generation-time IO (runs inside 'preprocess')
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
manifestPath, removedPath :: FilePath
|
||||
manifestPath = "archive/manifest.yaml"
|
||||
removedPath = "archive/removed.yaml"
|
||||
|
||||
-- | Read @archive/manifest.yaml@. An absent file yields an empty list
|
||||
-- (the archive degrades to invisible, matching the @.venv@-gated
|
||||
-- silent-skip convention). A *parse error on a present file* halts the
|
||||
-- build: the file exists but is broken — degrading to invisible would
|
||||
-- swallow real errors like a typo'd @visibility@ value or a malformed
|
||||
-- entry, both of which are publication-relevant.
|
||||
readManifest :: IO [ManifestEntry]
|
||||
readManifest = do
|
||||
exists <- doesFileExist manifestPath
|
||||
if not exists
|
||||
then return []
|
||||
else do
|
||||
parsed <- Y.decodeFileEither manifestPath
|
||||
case parsed of
|
||||
Right es -> return es
|
||||
Left e -> do
|
||||
hPutStrLn stderr $
|
||||
"[archive] FATAL: manifest.yaml: " ++ show e
|
||||
exitFailure
|
||||
|
||||
readRemovedUrls :: IO (Set.Set T.Text)
|
||||
readRemovedUrls = do
|
||||
exists <- doesFileExist removedPath
|
||||
if not exists
|
||||
then return Set.empty
|
||||
else do
|
||||
parsed <- Y.decodeFileEither removedPath
|
||||
case parsed of
|
||||
Right entries -> return . Set.fromList $
|
||||
map (normalizeUrl . T.pack . reUrl) (entries :: [RemovedEntry])
|
||||
Left e -> do
|
||||
hPutStrLn stderr $
|
||||
"[archive] FATAL: removed.yaml: " ++ show e
|
||||
exitFailure
|
||||
|
||||
validateManifestEntries :: [ManifestEntry] -> Set.Set T.Text -> IO ()
|
||||
validateManifestEntries manifest removed = go Map.empty manifest
|
||||
where
|
||||
go _ [] = return ()
|
||||
go seen (entry : rest) = do
|
||||
let url = meUrl entry
|
||||
norm = normalizeUrl (T.pack url)
|
||||
when (norm `Set.member` removed) $ do
|
||||
hPutStrLn stderr $
|
||||
"[archive] FATAL: manifest URL " ++ show url
|
||||
++ " is also recorded in removed.yaml; refusing to publish "
|
||||
++ "a deliberately removed work."
|
||||
exitFailure
|
||||
case Map.lookup norm seen of
|
||||
Just prior -> do
|
||||
hPutStrLn stderr $
|
||||
"[archive] FATAL: manifest URLs " ++ show prior ++ " and "
|
||||
++ show url ++ " normalise to the same archive target."
|
||||
exitFailure
|
||||
Nothing -> go (Map.insert norm url seen) rest
|
||||
|
||||
-- | Scan @archive/<slug>/PROVENANCE.json@ into a @url -> (slug, Provenance)@
|
||||
-- map. The directory name is the slug; the join key is the URL.
|
||||
readProvenances :: IO (Map.Map String (String, Provenance))
|
||||
readProvenances = do
|
||||
exists <- doesDirectoryExist "archive"
|
||||
if not exists
|
||||
then return Map.empty
|
||||
else do
|
||||
names <- listDirectory "archive"
|
||||
entries <- forM names $ \name -> do
|
||||
let provPath = "archive/" ++ name ++ "/PROVENANCE.json"
|
||||
isFile <- doesFileExist provPath
|
||||
if not isFile
|
||||
then return Nothing
|
||||
else do
|
||||
decoded <- A.eitherDecodeFileStrict' provPath
|
||||
case decoded of
|
||||
Right p -> return (Just (pvUrl p, (name, p)))
|
||||
Left e -> do
|
||||
hPutStrLn stderr $
|
||||
"[archive] FATAL: " ++ provPath ++ ": " ++ show e
|
||||
exitFailure
|
||||
return (Map.fromList (catMaybes entries))
|
||||
|
||||
-- | Read a file, returning "" on any error (e.g. an absent text sidecar).
|
||||
readFileSafe :: FilePath -> IO String
|
||||
readFileSafe path =
|
||||
catch (readFile' path) (\(_ :: SomeException) -> return "")
|
||||
|
||||
-- | Verify a committed artifact's SHA-256 against its recorded value.
|
||||
-- The build halts with a clear message on mismatch — so the integrity
|
||||
-- guarantee holds even when @tools/archive.py@ does not run first
|
||||
-- (e.g. no @.venv@, or a direct @cabal run site -- build@), and a
|
||||
-- tampered or corrupted artifact can never be deployed.
|
||||
--
|
||||
-- Shells out to @sha256sum@ (GNU coreutils — same toolchain the rest of
|
||||
-- the build assumes); a missing or non-zero @sha256sum@ surfaces as an
|
||||
-- exception that also halts the build.
|
||||
verifyArtifactSha :: String -> FilePath -> String -> IO ()
|
||||
verifyArtifactSha slug path expected = do
|
||||
out <- readProcess "sha256sum" [path] ""
|
||||
let actual = takeWhile (/= ' ') out
|
||||
when (actual /= expected) $ do
|
||||
hPutStrLn stderr $
|
||||
"[archive] FATAL: " ++ slug ++ ": " ++ path
|
||||
++ " SHA-256 mismatch (recorded " ++ expected
|
||||
++ ", found " ++ actual
|
||||
++ "). The committed artifact is corrupt or was replaced; "
|
||||
++ "halting build."
|
||||
exitFailure
|
||||
|
||||
-- | Join the authored manifest with generated provenance. A manifest
|
||||
-- entry with no matching provenance — or whose artifact is not on disk
|
||||
-- — is dropped, so it produces no page.
|
||||
loadArchiveEntries :: IO [ArchiveEntry]
|
||||
loadArchiveEntries = do
|
||||
manifest <- readManifest
|
||||
removed <- readRemovedUrls
|
||||
validateManifestEntries manifest removed
|
||||
provByUrl <- readProvenances
|
||||
fmap catMaybes $ forM manifest $ \me ->
|
||||
case Map.lookup (meUrl me) provByUrl of
|
||||
Nothing -> return Nothing
|
||||
Just (slug, pv) -> do
|
||||
let dir = "archive/" ++ slug
|
||||
txtPath = dir ++ "/" ++ textFileFor pv
|
||||
let artPath = dir ++ "/" ++ pvArtifact pv
|
||||
artifactThere <- doesFileExist artPath
|
||||
if not artifactThere
|
||||
then do
|
||||
hPutStrLn stderr $
|
||||
"[archive] FATAL: " ++ slug ++ ": " ++ artPath
|
||||
++ " is missing although PROVENANCE.json exists; "
|
||||
++ "restore the committed artifact before building."
|
||||
exitFailure
|
||||
else do
|
||||
verifyArtifactSha slug artPath (pvSha256 pv)
|
||||
txtThere <- doesFileExist txtPath
|
||||
txt <- if txtThere then readFileSafe txtPath
|
||||
else return ""
|
||||
return $ Just ArchiveEntry
|
||||
{ aeManifest = me
|
||||
, aeProv = pv
|
||||
, aeFulltext = txt
|
||||
, aeTextId = if txtThere then Just txtPath
|
||||
else Nothing
|
||||
, aeStatus = archiveStatusForSlug slug
|
||||
}
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Rules
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | All archive rules. Called once from 'Site.rules'.
|
||||
archiveRules :: Rules ()
|
||||
archiveRules = do
|
||||
entries <- preprocess loadArchiveEntries
|
||||
|
||||
-- Raw artifacts: the PDF / HTML snapshot of every *public* entry,
|
||||
-- served at its own path (/archive/<slug>/...). Routing this explicit
|
||||
-- list rather than a glob means a `visibility: private` entry's
|
||||
-- artifact is never deployed, and an orphan directory's artifact
|
||||
-- (no manifest line) is not deployed either.
|
||||
let publicArtifacts =
|
||||
[ fromFilePath ("archive/" ++ pvSlug (aeProv e)
|
||||
++ "/" ++ pvArtifact (aeProv e))
|
||||
| e <- entries, not (isPrivate e) ]
|
||||
match (fromList publicArtifacts) $ do
|
||||
route idRoute
|
||||
compile copyFileCompiler
|
||||
|
||||
-- Provenance, extracted text, and the manifest: matched (not routed)
|
||||
-- so the generated pages can `load` them as dependencies and recompile
|
||||
-- when they change.
|
||||
match "archive/*/PROVENANCE.json" $ compile getResourceBody
|
||||
match "archive/*/document.txt" $ compile getResourceBody
|
||||
match "archive/*/snapshot.txt" $ compile getResourceBody
|
||||
match "archive/manifest.yaml" $ compile getResourceBody
|
||||
|
||||
mapM_ archiveEntryRule entries
|
||||
archiveIndexRule entries
|
||||
|
||||
-- | One @/archive/<slug>/@ page.
|
||||
archiveEntryRule :: ArchiveEntry -> Rules ()
|
||||
archiveEntryRule ae =
|
||||
create [fromFilePath ("archive/" ++ slug ++ "/index.html")] $ do
|
||||
route idRoute
|
||||
compile $ do
|
||||
-- Dependency edges: recompile when provenance or the manifest
|
||||
-- changes. The extracted-text sidecar is gitignored and may be
|
||||
-- absent (no .venv / fetch never ran); load it as a dependency
|
||||
-- only when present, so the build never fails for a missing
|
||||
-- generated file.
|
||||
_ <- load provId :: Compiler (Item String)
|
||||
_ <- load manifestId :: Compiler (Item String)
|
||||
case aeTextId ae of
|
||||
Just tp -> do
|
||||
_ <- load (fromFilePath tp) :: Compiler (Item String)
|
||||
return ()
|
||||
Nothing -> return ()
|
||||
makeItem ""
|
||||
>>= loadAndApplyTemplate "templates/archive.html" ctx
|
||||
>>= loadAndApplyTemplate "templates/default.html" ctx
|
||||
>>= relativizeUrls
|
||||
where
|
||||
slug = pvSlug (aeProv ae)
|
||||
provId = fromFilePath ("archive/" ++ slug ++ "/PROVENANCE.json")
|
||||
manifestId = fromFilePath manifestPath
|
||||
ctx = archiveEntryCtx ae
|
||||
|
||||
-- | The @/archive/@ index — every archived work, newest snapshot first.
|
||||
archiveIndexRule :: [ArchiveEntry] -> Rules ()
|
||||
archiveIndexRule entries =
|
||||
create ["archive/index.html"] $ do
|
||||
route idRoute
|
||||
compile $ do
|
||||
-- Recompile when any provenance appears / changes, or the
|
||||
-- manifest changes.
|
||||
_ <- loadAll "archive/*/PROVENANCE.json" :: Compiler [Item String]
|
||||
_ <- load (fromFilePath manifestPath) :: Compiler (Item String)
|
||||
let sorted = sortBy (comparing (Down . pvArchived . aeProv)) entries
|
||||
items = map (\e -> Item (fromFilePath ("archive/" ++ pvSlug (aeProv e))) e)
|
||||
sorted
|
||||
ctx = listField "entries" entryListCtx (return items)
|
||||
<> constField "title" "Archive"
|
||||
<> constField "archive" "true"
|
||||
<> constField "noindex" "true"
|
||||
<> (if null entries then mempty
|
||||
else constField "has-entries" "true")
|
||||
<> siteCtx
|
||||
makeItem ""
|
||||
>>= loadAndApplyTemplate "templates/archive-index.html" ctx
|
||||
>>= loadAndApplyTemplate "templates/default.html" ctx
|
||||
>>= relativizeUrls
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Contexts
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | Per-entry context for the @/archive/<slug>/@ page.
|
||||
archiveEntryCtx :: ArchiveEntry -> Context String
|
||||
archiveEntryCtx ae = mconcat
|
||||
[ constField "title" (pvTitle pv)
|
||||
, constField "archive" "true"
|
||||
, constField "noindex" "true"
|
||||
, constField "original-url" (meUrl me)
|
||||
, constField "archived" (pvArchived pv)
|
||||
, constField "archive-type" (pvType pv)
|
||||
, constField "sha-short" (take 12 (pvSha256 pv))
|
||||
, constField "size" (formatBytes (pvBytes pv))
|
||||
, constField "snapshot-quality" (pvQuality pv)
|
||||
, constField "status" (statusName (aeStatus ae))
|
||||
, qualityFlag
|
||||
, maybeField "status-note" (statusNote (aeStatus ae))
|
||||
, maybeField "note" (meNote me)
|
||||
, maybeField "wayback" (pvWayback pv)
|
||||
, maybeField "paywalled" (if mePaywalled me then Just "true" else Nothing)
|
||||
, visibilityFields
|
||||
-- "Referenced by" (the pages that cite this work) and "Related"
|
||||
-- (semantically near content). Both resolve by this page's route, so
|
||||
-- they need no archive-specific wiring; each is a $if(...)$-guarded
|
||||
-- section in archive.html.
|
||||
, referencedByField
|
||||
, similarLinksField
|
||||
, siteCtx
|
||||
]
|
||||
where
|
||||
me = aeManifest ae
|
||||
pv = aeProv ae
|
||||
slug = pvSlug pv
|
||||
artUrl = "/archive/" ++ slug ++ "/" ++ pvArtifact pv
|
||||
-- A non-'ok' snapshot raises a visible flag on the page.
|
||||
qualityFlag
|
||||
| pvQuality pv == "ok" = mempty
|
||||
| otherwise = constField "degraded" "true"
|
||||
-- A private entry keeps a local preservation copy but publishes none
|
||||
-- of it: no embed, no extracted text — only the provenance metadata
|
||||
-- and a 'held offline' note. A public entry embeds the artifact raw
|
||||
-- (the browser renders the PDF natively, the snapshot loads directly;
|
||||
-- no PDF.js wrapper) and renders its extracted text into the page.
|
||||
-- The is-pdf / is-html flag drives only the iframe sandbox: a
|
||||
-- third-party HTML snapshot is sandboxed, our own committed PDF is not.
|
||||
visibilityFields
|
||||
| isPrivate ae = constField "private" "true"
|
||||
| otherwise = typeField
|
||||
<> constField "artifact-url" artUrl
|
||||
<> constField "artifact-name" (pvArtifact pv)
|
||||
<> fulltextField (pvType pv) (aeFulltext ae)
|
||||
typeField
|
||||
| pvType pv == "html" = constField "is-html" "true"
|
||||
| otherwise = constField "is-pdf" "true"
|
||||
|
||||
-- | Renders the extracted full text into the page DOM so embed.py and
|
||||
-- Pagefind index real text, not an opaque iframe. PDF text keeps its
|
||||
-- pdftotext layout in a @<pre>@; HTML text is block-separated prose, so
|
||||
-- it renders as escaped @<p>@ paragraphs. Absent when the text is empty
|
||||
-- / whitespace, so the @$if(fulltext)$@ guard hides the section.
|
||||
fulltextField :: String -> String -> Context String
|
||||
fulltextField ftype txt
|
||||
| all isBlank txt = mempty
|
||||
| ftype == "html" = constField "fulltext" (htmlParagraphs txt)
|
||||
| otherwise = constField "fulltext" preBlock
|
||||
where
|
||||
isBlank c = c == ' ' || c == '\n' || c == '\t' || c == '\r'
|
||||
preBlock = "<pre class=\"archive-fulltext\">"
|
||||
++ escapeHtml txt ++ "</pre>"
|
||||
|
||||
-- | Block-separated text (paragraphs delimited by blank lines, as
|
||||
-- @archive.py@'s HTML extractor writes it) → escaped @<p>@ elements.
|
||||
htmlParagraphs :: String -> String
|
||||
htmlParagraphs = concatMap para . paragraphsOf
|
||||
where
|
||||
para p = "<p>" ++ escapeHtml p ++ "</p>\n"
|
||||
paragraphsOf = map (unwords . concatMap words)
|
||||
. filter (not . blankGroup)
|
||||
. groupBy ((==) `on` blankLine)
|
||||
. lines
|
||||
blankGroup g = null g || blankLine (head g)
|
||||
blankLine = all (`elem` (" \t\r" :: String))
|
||||
|
||||
-- | List-item context for the @/archive/@ index.
|
||||
entryListCtx :: Context ArchiveEntry
|
||||
entryListCtx = mconcat
|
||||
[ field "entry-title" (return . pvTitle . aeProv . itemBody)
|
||||
, field "entry-archived" (return . pvArchived . aeProv . itemBody)
|
||||
, field "entry-type" (return . pvType . aeProv . itemBody)
|
||||
, field "entry-quality" (return . pvQuality . aeProv . itemBody)
|
||||
, boolField "entry-degraded" ((/= "ok") . pvQuality . aeProv . itemBody)
|
||||
, boolField "entry-private" (isPrivate . itemBody)
|
||||
, field "entry-status" (return . statusName . aeStatus . itemBody)
|
||||
, boolField "entry-rotted" ((== Rotted) . aeStatus . itemBody)
|
||||
, field "entry-url" (\i -> return $
|
||||
"/archive/" ++ pvSlug (aeProv (itemBody i)) ++ "/")
|
||||
]
|
||||
|
||||
-- | Provide a field only when the value is present; otherwise contribute
|
||||
-- nothing, so the template's @$if(...)$@ guard is false.
|
||||
maybeField :: String -> Maybe String -> Context String
|
||||
maybeField k = maybe mempty (constField k)
|
||||
|
||||
-- | A prose note for a non-live link-rot status, shown on the archive
|
||||
-- page; 'Nothing' for 'Live' / 'Error' (no note rendered).
|
||||
statusNote :: ArchiveStatus -> Maybe String
|
||||
statusNote Rotted = Just "The original is no longer reachable. This archived \
|
||||
\copy is now the live link."
|
||||
statusNote Moved = Just "The original page has moved since this snapshot was \
|
||||
\taken; the link above may redirect."
|
||||
statusNote _ = Nothing
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Formatting
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | Human-readable byte count (mirrors the helper in build/Stats.hs).
|
||||
formatBytes :: Integer -> String
|
||||
formatBytes b
|
||||
| b < 1024 = show b ++ " B"
|
||||
| b < 1024 * 1024 = showD (b * 10 `div` 1024) ++ " KB"
|
||||
| otherwise = showD (b * 10 `div` (1024 * 1024)) ++ " MB"
|
||||
where
|
||||
showD n = show (n `div` 10) ++ "." ++ show (n `mod` 10)
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- /build/ telemetry
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | Archive metrics for the @/build/@ telemetry page — count, total size,
|
||||
-- median artifact age, breakdowns by link-rot status / snapshot quality
|
||||
-- / visibility, the paywalled count, and any orphan directories.
|
||||
-- Rendered by @Stats.hs@; an empty archive yields just the count.
|
||||
archiveBuildStats :: IO [(String, String)]
|
||||
archiveBuildStats = do
|
||||
entries <- loadArchiveEntries
|
||||
today <- utctDay <$> getCurrentTime
|
||||
orphans <- findOrphanDirs entries
|
||||
let n = length entries
|
||||
bytes = sum (map (pvBytes . aeProv) entries)
|
||||
ages = [ fromInteger (diffDays today d)
|
||||
| e <- entries
|
||||
, Just d <- [parseIsoDay (pvArchived (aeProv e))] ]
|
||||
paywalled = length (filter (mePaywalled . aeManifest) entries)
|
||||
return $
|
||||
[ ("Entries", show n) ]
|
||||
++ (if n == 0 then [] else
|
||||
[ ("Total size", formatBytes bytes)
|
||||
, ("Median age", medianAge ages)
|
||||
, ("By status", tallyOf (map (statusName . aeStatus) entries))
|
||||
, ("By quality", tallyOf (map (pvQuality . aeProv) entries))
|
||||
, ("By visibility", tallyOf (map (meVisibility . aeManifest) entries))
|
||||
])
|
||||
++ [ ("Paywalled", show paywalled) | paywalled > 0 ]
|
||||
++ [ ("Orphan directories", unwords orphans) | not (null orphans) ]
|
||||
|
||||
-- | Directory names under @archive/@ that hold a @PROVENANCE.json@ but are
|
||||
-- not a live manifest entry — drift the @/build/@ page should surface.
|
||||
findOrphanDirs :: [ArchiveEntry] -> IO [String]
|
||||
findOrphanDirs entries = do
|
||||
exists <- doesDirectoryExist "archive"
|
||||
if not exists
|
||||
then return []
|
||||
else do
|
||||
names <- listDirectory "archive"
|
||||
let live = map (pvSlug . aeProv) entries
|
||||
filterM
|
||||
(\name -> do
|
||||
hasProv <- doesFileExist
|
||||
("archive/" ++ name ++ "/PROVENANCE.json")
|
||||
return (hasProv && name `notElem` live))
|
||||
(sort names)
|
||||
|
||||
-- | Format a multiset of string values as @"a 2 \183 b 1"@.
|
||||
tallyOf :: [String] -> String
|
||||
tallyOf xs = intercalate " \183 "
|
||||
[ k ++ " " ++ show c
|
||||
| (k, c) <- Map.toList (Map.fromListWith (+) [ (x, 1 :: Int) | x <- xs ]) ]
|
||||
|
||||
-- | The median of a list of ages, as @"N days"@; an em dash when empty.
|
||||
medianAge :: [Int] -> String
|
||||
medianAge [] = "\8212"
|
||||
medianAge xs =
|
||||
let m = sort xs !! (length xs `div` 2)
|
||||
in show m ++ if m == 1 then " day" else " days"
|
||||
|
||||
-- | Parse a @YYYY-MM-DD@ date; 'Nothing' on malformed input.
|
||||
parseIsoDay :: String -> Maybe Day
|
||||
parseIsoDay s = case splitOnDash s of
|
||||
[y, m, d] -> fromGregorian <$> readMaybe y <*> readMaybe m <*> readMaybe d
|
||||
_ -> Nothing
|
||||
where
|
||||
splitOnDash str = case break (== '-') str of
|
||||
(a, '-' : rest) -> a : splitOnDash rest
|
||||
(a, _) -> [a]
|
||||
|
|
@ -0,0 +1,255 @@
|
|||
{-# LANGUAGE GHC2021 #-}
|
||||
{-# LANGUAGE OverloadedStrings #-}
|
||||
-- | ArchiveIndex — shared read-only access to the archive's two JSON
|
||||
-- sidecars: @data/archive-index.json@ (the @url\/alias -> slug@ map
|
||||
-- written by @archive.py fetch@) and @data/archive-state.json@ (the
|
||||
-- per-URL link-rot status written by @archive.py check@).
|
||||
--
|
||||
-- Consumers:
|
||||
--
|
||||
-- * @Filters.Archive@ — appends the archive affordance to body links
|
||||
-- whose target is archived, and flips a @rotted@ link to the local
|
||||
-- copy.
|
||||
-- * @Backlinks@ — keeps archived external links through pass 1 and
|
||||
-- canonicalises them to their @/archive/<slug>/@ page in pass 2.
|
||||
-- * @Archive@ — surfaces each entry's rot status on its page, the
|
||||
-- @/archive/@ index, and the @/build/@ telemetry.
|
||||
--
|
||||
-- Both files are loaded once per build via @unsafePerformIO@ CAFs. An
|
||||
-- absent or malformed file degrades safely: an empty index makes the
|
||||
-- link consumers no-op; an absent state file makes every entry @Live@
|
||||
-- (the safe default — no link flip). @archive.py check@ is decoupled
|
||||
-- from @make build@; a build consumes whatever state file exists.
|
||||
module ArchiveIndex
|
||||
( ArchiveStatus (..)
|
||||
, statusName
|
||||
, archiveSlugFor
|
||||
, archiveStatusForSlug
|
||||
, archiveIndexIsEmpty
|
||||
, normalizeUrl
|
||||
) where
|
||||
|
||||
import Data.Map.Strict (Map)
|
||||
import qualified Data.Map.Strict as Map
|
||||
import Data.Maybe (fromMaybe)
|
||||
import Data.Set (Set)
|
||||
import qualified Data.Set as Set
|
||||
import Data.Text (Text)
|
||||
import qualified Data.Text as T
|
||||
import qualified Data.Aeson as A
|
||||
import Data.Aeson ((.!=), (.:), (.:?))
|
||||
import qualified Data.Yaml as Y
|
||||
import System.Directory (doesFileExist)
|
||||
import System.IO.Unsafe (unsafePerformIO)
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Link-rot status
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | The link-rot status of an archived work's original URL, as set by
|
||||
-- @archive.py check@. 'Live' is the safe default for an unscanned or
|
||||
-- unknown entry.
|
||||
data ArchiveStatus = Live | Moved | Rotted | Error
|
||||
deriving (Eq, Show)
|
||||
|
||||
-- | The lower-case wire name, matching @archive-state.json@ and the
|
||||
-- @status:@ Pagefind filter tag.
|
||||
statusName :: ArchiveStatus -> String
|
||||
statusName Live = "live"
|
||||
statusName Moved = "moved"
|
||||
statusName Rotted = "rotted"
|
||||
statusName Error = "error"
|
||||
|
||||
parseStatus :: Text -> ArchiveStatus
|
||||
parseStatus "moved" = Moved
|
||||
parseStatus "rotted" = Rotted
|
||||
parseStatus "error" = Error
|
||||
parseStatus _ = Live
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- JSON shapes
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | One @archive-index.json@ entry. Only @slug@ and @aliases@ are used.
|
||||
data IdxEntry = IdxEntry
|
||||
{ ieSlug :: String
|
||||
, ieAliases :: [Text]
|
||||
}
|
||||
|
||||
instance A.FromJSON IdxEntry where
|
||||
parseJSON = A.withObject "IdxEntry" $ \o -> IdxEntry
|
||||
<$> o .: "slug"
|
||||
<*> (o .:? "aliases" .!= [])
|
||||
|
||||
-- | One @archive-state.json@ entry — only the @status@ is consumed here.
|
||||
newtype StateEntry = StateEntry { seStatus :: ArchiveStatus }
|
||||
|
||||
instance A.FromJSON StateEntry where
|
||||
parseJSON = A.withObject "StateEntry" $ \o ->
|
||||
StateEntry . parseStatus <$> (o .:? "status" .!= "live")
|
||||
|
||||
newtype UrlEntry = UrlEntry { ueUrl :: Text }
|
||||
|
||||
instance A.FromJSON UrlEntry where
|
||||
parseJSON = A.withObject "UrlEntry" $ \o ->
|
||||
UrlEntry <$> o .: "url"
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Loaded-once CAFs
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
indexPath, statePath, manifestPath, removedPath :: FilePath
|
||||
indexPath = "data/archive-index.json"
|
||||
statePath = "data/archive-state.json"
|
||||
manifestPath = "archive/manifest.yaml"
|
||||
removedPath = "archive/removed.yaml"
|
||||
|
||||
readUrlSet :: FilePath -> IO (Set Text)
|
||||
readUrlSet path = do
|
||||
exists <- doesFileExist path
|
||||
if not exists
|
||||
then return Set.empty
|
||||
else do
|
||||
decoded <- Y.decodeFileEither path
|
||||
case decoded of
|
||||
Right entries -> return . Set.fromList $
|
||||
map (normalizeUrl . ueUrl) (entries :: [UrlEntry])
|
||||
Left e -> ioError . userError $
|
||||
"[archive] FATAL: " ++ path ++ ": " ++ show e
|
||||
|
||||
-- | Canonical URLs still permitted to participate in link annotation.
|
||||
-- Filtering the generated index at build time makes a direct Hakyll build
|
||||
-- respect authored manifest/removal state even when archive.py did not run.
|
||||
{-# NOINLINE activeUrls #-}
|
||||
activeUrls :: Set Text
|
||||
activeUrls = unsafePerformIO $ do
|
||||
manifest <- readUrlSet manifestPath
|
||||
removed <- readUrlSet removedPath
|
||||
return (manifest `Set.difference` removed)
|
||||
|
||||
-- | @canonical-url -> entry@. Absent/malformed file -> empty; entries no
|
||||
-- longer permitted by the authored manifest/removal state are removed.
|
||||
{-# NOINLINE rawIndex #-}
|
||||
rawIndex :: Map Text IdxEntry
|
||||
rawIndex = unsafePerformIO $ do
|
||||
decoded <- A.eitherDecodeFileStrict' indexPath
|
||||
let parsed = either (const Map.empty) id decoded
|
||||
return $ Map.filterWithKey
|
||||
(\canon _ -> normalizeUrl canon `Set.member` activeUrls)
|
||||
parsed
|
||||
|
||||
-- | @url -> status@. Absent/malformed file -> empty (every entry 'Live').
|
||||
{-# NOINLINE rawState #-}
|
||||
rawState :: Map Text ArchiveStatus
|
||||
rawState = unsafePerformIO $ do
|
||||
decoded <- A.eitherDecodeFileStrict' statePath
|
||||
return $ either (const Map.empty) (Map.map seStatus) decoded
|
||||
|
||||
-- | @normalised-url -> slug@: the canonical key and every alias from
|
||||
-- @archive-index.json@, each fed through 'normalizeUrl'. Both keys and
|
||||
-- lookups are normalised, so a citation form the alias set cannot
|
||||
-- enumerate (e.g. an unbounded arXiv version, or any tracking-laden
|
||||
-- variant of a clean manifest URL) still resolves.
|
||||
{-# NOINLINE flatIndex #-}
|
||||
flatIndex :: Map Text String
|
||||
flatIndex = Map.fromList
|
||||
[ (normalizeUrl key, ieSlug e)
|
||||
| (canon, e) <- Map.toList rawIndex
|
||||
, key <- canon : ieAliases e
|
||||
]
|
||||
|
||||
-- | @slug -> status@: each entry's status, looked up by its canonical URL
|
||||
-- in the state file (the two files share the manifest URL as key).
|
||||
{-# NOINLINE slugStatus #-}
|
||||
slugStatus :: Map String ArchiveStatus
|
||||
slugStatus = Map.fromList
|
||||
[ (ieSlug e, Map.findWithDefault Live canon rawState)
|
||||
| (canon, e) <- Map.toList rawIndex
|
||||
]
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Public lookups
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | True when no archive index is available — the link consumers no-op.
|
||||
archiveIndexIsEmpty :: Bool
|
||||
archiveIndexIsEmpty = Map.null rawIndex
|
||||
|
||||
-- | The archive slug for an outbound URL, or 'Nothing'. Both the index
|
||||
-- keys and the input go through 'normalizeUrl', so a citation form that
|
||||
-- the alias set cannot enumerate — an unbounded arXiv version, or any
|
||||
-- tracking-laden variant of a clean manifest URL — still resolves.
|
||||
archiveSlugFor :: Text -> Maybe String
|
||||
archiveSlugFor url = Map.lookup (normalizeUrl url) flatIndex
|
||||
|
||||
-- | The link-rot status of an archived entry, by slug. 'Live' for an
|
||||
-- unknown slug or when no scan has run.
|
||||
archiveStatusForSlug :: String -> ArchiveStatus
|
||||
archiveStatusForSlug slug = Map.findWithDefault Live slug slugStatus
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- URL normalisation (matching, not display)
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | Tracking-only query parameters: their presence or absence is
|
||||
-- semantically irrelevant; the lookup strips them before matching.
|
||||
-- Sync with @TRACKING_PARAMS@ in @tools/archive.py@.
|
||||
trackingParams :: [Text]
|
||||
trackingParams =
|
||||
[ "utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content"
|
||||
, "fbclid", "gclid", "mc_eid", "mc_cid", "ref", "igshid"
|
||||
, "_hsenc", "_hsmi", "mkt_tok"
|
||||
]
|
||||
|
||||
-- | Remove tracking-only query parameters; preserve every other parameter
|
||||
-- in its original order.
|
||||
stripTracking :: Text -> Text
|
||||
stripTracking url = case T.breakOn "?" url of
|
||||
(_, "") -> url
|
||||
(path, q) ->
|
||||
let kept = filter notTracking (T.splitOn "&" (T.drop 1 q))
|
||||
in if null kept then path
|
||||
else path <> "?" <> T.intercalate "&" kept
|
||||
where
|
||||
notTracking p = T.takeWhile (/= '=') p `notElem` trackingParams
|
||||
|
||||
-- | The canonical form of an arXiv URL: @https://arxiv.org/abs/<id>@ with
|
||||
-- no version suffix and no @.pdf@. Maps every member of the
|
||||
-- abs/pdf/versioned/@.pdf@ family to the same key. Non-arXiv passes through.
|
||||
arxivCanonical :: Text -> Text
|
||||
arxivCanonical url
|
||||
| Just rest <- T.stripPrefix "https://arxiv.org/" url
|
||||
, Just key <- arxivKey rest = key
|
||||
| Just rest <- T.stripPrefix "http://arxiv.org/" url
|
||||
, Just key <- arxivKey rest = key
|
||||
| otherwise = url
|
||||
where
|
||||
arxivKey rest = case T.breakOn "/" rest of
|
||||
(kind, slashId)
|
||||
| kind `elem` ["abs", "pdf"], not (T.null slashId) ->
|
||||
Just $ "https://arxiv.org/abs/"
|
||||
<> stripVer (stripPdfSuf (T.tail slashId))
|
||||
_ -> Nothing
|
||||
stripPdfSuf t = fromMaybe t (T.stripSuffix ".pdf" t)
|
||||
stripVer t = case T.breakOnEnd "v" t of
|
||||
(before, ver)
|
||||
| not (T.null before)
|
||||
, not (T.null ver)
|
||||
, T.all isAsciiDigit ver
|
||||
-> T.dropEnd 1 before
|
||||
_ -> t
|
||||
isAsciiDigit c = c >= '0' && c <= '9'
|
||||
|
||||
-- | The full normalisation: drop fragment, strip tracking, fold
|
||||
-- @http://@→@https://@, arXiv-canonicalise, trim a trailing slash. Both
|
||||
-- 'flatIndex' keys and 'archiveSlugFor' inputs go through this so the
|
||||
-- index never misses a citation form the design promises to match.
|
||||
normalizeUrl :: Text -> Text
|
||||
normalizeUrl url =
|
||||
let noFrag = T.takeWhile (/= '#') url
|
||||
clean = stripTracking noFrag
|
||||
https = case T.stripPrefix "http://" clean of
|
||||
Just rest -> "https://" <> rest
|
||||
Nothing -> clean
|
||||
arxiv = arxivCanonical https
|
||||
in T.dropWhileEnd (== '/') arxiv
|
||||
|
|
@ -25,9 +25,11 @@
|
|||
module Backlinks
|
||||
( backlinkRules
|
||||
, backlinksField
|
||||
, referencedByField
|
||||
) where
|
||||
|
||||
import Data.List (nubBy, sortBy)
|
||||
import Data.List (nubBy, partition, sortBy,
|
||||
stripPrefix)
|
||||
import Data.Ord (comparing)
|
||||
import Data.Maybe (fromMaybe)
|
||||
import qualified Data.Map.Strict as Map
|
||||
|
|
@ -50,6 +52,7 @@ import Hakyll
|
|||
import Compilers (readerOpts, writerOpts)
|
||||
import Filters (preprocessSource)
|
||||
import qualified Patterns as P
|
||||
import ArchiveIndex (archiveSlugFor)
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Link-with-context entry (intermediate, saved by the "links" pass)
|
||||
|
|
@ -85,6 +88,7 @@ data BacklinkSource = BacklinkSource
|
|||
, blAbstract :: String
|
||||
, blSentence :: String -- raw HTML of the sentence containing the link
|
||||
, blParagraph :: String -- raw HTML of the full paragraph (hover popup)
|
||||
, blFragment :: String -- archived-target fragment (no '#'), else ""
|
||||
} deriving (Show, Eq, Ord)
|
||||
|
||||
instance Aeson.ToJSON BacklinkSource where
|
||||
|
|
@ -94,6 +98,7 @@ instance Aeson.ToJSON BacklinkSource where
|
|||
, "abstract" .= blAbstract bl
|
||||
, "sentence" .= blSentence bl
|
||||
, "paragraph" .= blParagraph bl
|
||||
, "fragment" .= blFragment bl
|
||||
]
|
||||
|
||||
instance Aeson.FromJSON BacklinkSource where
|
||||
|
|
@ -104,6 +109,7 @@ instance Aeson.FromJSON BacklinkSource where
|
|||
<*> o Aeson..: "abstract"
|
||||
<*> o Aeson..: "sentence"
|
||||
<*> o Aeson..: "paragraph"
|
||||
<*> o Aeson..:? "fragment" Aeson..!= ""
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Writer options for context rendering
|
||||
|
|
@ -125,7 +131,11 @@ contextWriterOpts = writerOpts
|
|||
-- | URL filter: skip external links, pseudo-schemes, anchor-only fragments,
|
||||
-- and static-asset paths.
|
||||
isPageLink :: T.Text -> Bool
|
||||
isPageLink u =
|
||||
isPageLink u
|
||||
-- An archived external URL is kept regardless of scheme or extension:
|
||||
-- pass 2 inverts it to its /archive/<slug>/ page.
|
||||
| isArchived = True
|
||||
| otherwise =
|
||||
not (T.isPrefixOf "http://" u) &&
|
||||
not (T.isPrefixOf "https://" u) &&
|
||||
not (T.isPrefixOf "#" u) &&
|
||||
|
|
@ -134,6 +144,9 @@ isPageLink u =
|
|||
not (T.null u) &&
|
||||
not (hasStaticExt u)
|
||||
where
|
||||
isArchived = case archiveSlugFor u of
|
||||
Just _ -> True
|
||||
Nothing -> False
|
||||
staticExts = [".pdf",".svg",".png",".jpg",".jpeg",".webp",
|
||||
".mp3",".mp4",".woff2",".woff",".ttf",".ico",
|
||||
".json",".asc",".xml",".gz",".zip"]
|
||||
|
|
@ -289,6 +302,28 @@ percentDecode = T.unpack . TE.decodeUtf8With lenientDecode . pack . go
|
|||
pack = BS.pack
|
||||
lenientDecode = TE.lenientDecode
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Archive-aware target keying
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
||||
-- | The @data/backlinks.json@ key an outbound URL inverts to. An archived
|
||||
-- external URL canonicalises to its @/archive/<slug>/@ page key — computed
|
||||
-- exactly as 'backlinksFieldWith' computes the archive page's own key (the
|
||||
-- same string fed through 'normaliseUrl'), so the two always agree. Every
|
||||
-- other URL is normalised as before.
|
||||
targetKey :: T.Text -> T.Text
|
||||
targetKey u = case archiveSlugFor u of
|
||||
Just slug -> T.pack (normaliseUrl ("/archive/" ++ slug ++ "/index.html"))
|
||||
Nothing -> T.pack (normaliseUrl (T.unpack u))
|
||||
|
||||
-- | The fragment (without @#@) of an archived URL, for granular grouping
|
||||
-- of "Referenced by". Empty for a non-archived URL or one with no fragment
|
||||
-- — so granular grouping stays an archive-only behaviour.
|
||||
archiveFragment :: T.Text -> String
|
||||
archiveFragment u = case archiveSlugFor u of
|
||||
Just _ -> T.unpack (T.drop 1 (T.dropWhile (/= '#') u))
|
||||
Nothing -> ""
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Content patterns (must match the rules in Site.hs — sourced from
|
||||
-- Patterns.allContent so additions to the canonical list automatically
|
||||
|
|
@ -337,10 +372,11 @@ toSourcePairs item = do
|
|||
:: Maybe [LinkEntry] of
|
||||
Nothing -> return []
|
||||
Just entries ->
|
||||
return [ ( T.pack (normaliseUrl (T.unpack (leUrl e)))
|
||||
return [ ( targetKey (leUrl e)
|
||||
, BacklinkSource srcUrl title abstract
|
||||
(leSentence e)
|
||||
(leParagraph e)
|
||||
(archiveFragment (leUrl e))
|
||||
)
|
||||
| e <- entries ]
|
||||
|
||||
|
|
@ -352,7 +388,20 @@ toSourcePairs item = do
|
|||
-- to the current page, each with its paragraph context.
|
||||
-- Returns @noResult@ (so @$if(backlinks)$@ is false) when there are none.
|
||||
backlinksField :: Context String
|
||||
backlinksField = field "backlinks" $ \item -> do
|
||||
backlinksField = backlinksFieldWith renderBacklinks "backlinks"
|
||||
|
||||
-- | "Referenced by" for archive pages. Same lookup as 'backlinksField',
|
||||
-- but the sources are grouped by the fragment each citation targets, so an
|
||||
-- archived work's page can show which section/page each citing essay points
|
||||
-- at (granular backlinks).
|
||||
referencedByField :: Context String
|
||||
referencedByField = backlinksFieldWith renderReferencedBy "referenced-by"
|
||||
|
||||
-- | Shared machinery for 'backlinksField' and 'referencedByField': look the
|
||||
-- page up in @data/backlinks.json@ by its normalised route, then hand the
|
||||
-- sorted sources to the given renderer.
|
||||
backlinksFieldWith :: ([BacklinkSource] -> String) -> String -> Context String
|
||||
backlinksFieldWith renderSources name = field name $ \item -> do
|
||||
blItem <- load (fromFilePath "data/backlinks.json") :: Compiler (Item String)
|
||||
case Aeson.decodeStrict (TE.encodeUtf8 (T.pack (itemBody blItem)))
|
||||
:: Maybe (Map T.Text [BacklinkSource]) of
|
||||
|
|
@ -367,7 +416,7 @@ backlinksField = field "backlinks" $ \item -> do
|
|||
sorted = sortBy (comparing blTitle) sources
|
||||
in if null sorted
|
||||
then fail "no backlinks"
|
||||
else return (renderBacklinks sorted)
|
||||
else return (renderSources sorted)
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- HTML rendering
|
||||
|
|
@ -384,10 +433,44 @@ backlinksField = field "backlinks" $ \item -> do
|
|||
renderBacklinks :: [BacklinkSource] -> String
|
||||
renderBacklinks sources =
|
||||
"<ul class=\"backlinks-list\">\n"
|
||||
++ concatMap renderOne sources
|
||||
++ concatMap renderBacklinkItem sources
|
||||
++ "</ul>"
|
||||
|
||||
-- | "Referenced by", grouped by the fragment each citation targets.
|
||||
-- Sources citing the work with no fragment render first as a plain list;
|
||||
-- each distinct fragment then gets its own subheading. With no fragments
|
||||
-- anywhere (the common case) this collapses to exactly the flat list.
|
||||
renderReferencedBy :: [BacklinkSource] -> String
|
||||
renderReferencedBy sources =
|
||||
let (general, fragmented) = partition (null . blFragment) sources
|
||||
groups = Map.toList $ Map.fromListWith (flip (++))
|
||||
[ (blFragment s, [s]) | s <- fragmented ]
|
||||
in renderList general ++ concatMap renderGroup groups
|
||||
where
|
||||
renderOne bl =
|
||||
renderList [] = ""
|
||||
renderList ss = "<ul class=\"backlinks-list\">\n"
|
||||
++ concatMap renderBacklinkItem ss ++ "</ul>\n"
|
||||
renderGroup (frag, ss) =
|
||||
"<div class=\"referenced-by-group\">"
|
||||
++ "<h3 class=\"referenced-by-fragment\">"
|
||||
++ escapeHtml (fragmentLabel frag) ++ "</h3>"
|
||||
++ renderList ss
|
||||
++ "</div>\n"
|
||||
|
||||
-- | Human label for a cited fragment: a PDF @#page=N@ becomes "Page N";
|
||||
-- any other @#anchor@ is shown verbatim behind a section mark.
|
||||
fragmentLabel :: String -> String
|
||||
fragmentLabel frag =
|
||||
case stripPrefix "page=" frag of
|
||||
Just n -> "Page " ++ n
|
||||
Nothing -> "\x00A7 " ++ frag
|
||||
|
||||
-- | One backlink @<li>@: the source title as a link, the sentence of
|
||||
-- context as a blockquote, and a hover affordance revealing the full
|
||||
-- paragraph. 'blSentence' / 'blParagraph' are already HTML fragments from
|
||||
-- the Pandoc writer, so they are emitted unescaped.
|
||||
renderBacklinkItem :: BacklinkSource -> String
|
||||
renderBacklinkItem bl =
|
||||
"<li class=\"backlink-item\">"
|
||||
++ "<a class=\"backlink-source\" href=\""
|
||||
++ escapeHtml (blUrl bl) ++ "\">"
|
||||
|
|
@ -395,11 +478,11 @@ renderBacklinks sources =
|
|||
++ ( if null (blSentence bl) then ""
|
||||
else "<blockquote class=\"backlink-quote\">"
|
||||
++ blSentence bl
|
||||
++ paragraphAffordance bl
|
||||
++ paragraphAffordance
|
||||
++ "</blockquote>" )
|
||||
++ "</li>\n"
|
||||
|
||||
paragraphAffordance bl
|
||||
where
|
||||
paragraphAffordance
|
||||
| null (blParagraph bl) = ""
|
||||
| blParagraph bl == blSentence bl = ""
|
||||
| otherwise =
|
||||
|
|
|
|||
|
|
@ -13,6 +13,7 @@ import qualified Filters.Typography as Typography
|
|||
import qualified Filters.Links as Links
|
||||
import qualified Filters.SourceRefs as SourceRefs
|
||||
import qualified Filters.Smallcaps as Smallcaps
|
||||
import qualified Filters.Archive as Archive
|
||||
import qualified Filters.Dropcaps as Dropcaps
|
||||
import qualified Filters.Math as Math
|
||||
import qualified Filters.Wikilinks as Wikilinks
|
||||
|
|
@ -40,6 +41,7 @@ applyAll srcDir doc = do
|
|||
. Sidenotes.apply
|
||||
. Typography.apply
|
||||
. Links.apply
|
||||
. Archive.apply
|
||||
. Smallcaps.apply
|
||||
. Dropcaps.apply
|
||||
. Math.apply
|
||||
|
|
|
|||
|
|
@ -0,0 +1,82 @@
|
|||
{-# LANGUAGE GHC2021 #-}
|
||||
{-# LANGUAGE OverloadedStrings #-}
|
||||
-- | Filters.Archive — annotate (and, for dead links, redirect) body links
|
||||
-- to archived works.
|
||||
--
|
||||
-- For every @Link@ whose URL matches an entry in @data/archive-index.json@
|
||||
-- (the equivalent-URL alias set included):
|
||||
--
|
||||
-- * a 'live', 'moved' or (inconclusive) 'error' target keeps its
|
||||
-- original link and gains a small superscript affordance pointing at
|
||||
-- the local @/archive/<slug>/@ page — purely additive;
|
||||
--
|
||||
-- * a 'rotted' target (confirmed dead by @archive.py check@'s
|
||||
-- hysteresis) has its primary link flipped to the archived copy, so
|
||||
-- a reader of an old essay reaches a working snapshot instead of a
|
||||
-- 404. A "archived" marker replaces the affordance.
|
||||
--
|
||||
-- Registered in 'Filters.applyAll' immediately after @Smallcaps@ and
|
||||
-- before @Links@: it must see the smallcaps-rewritten text, and it emits
|
||||
-- the affordance/marker as @RawInline@ so the downstream @Links@ pass
|
||||
-- never re-classifies it.
|
||||
--
|
||||
-- No-op when @data/archive-index.json@ is absent. When no rot scan has
|
||||
-- run, every entry is 'Live' — no link is ever flipped.
|
||||
module Filters.Archive (apply) where
|
||||
|
||||
import qualified Data.Text as T
|
||||
import Text.Pandoc.Definition
|
||||
import Text.Pandoc.Walk (walk)
|
||||
import ArchiveIndex (ArchiveStatus (..), archiveIndexIsEmpty,
|
||||
archiveSlugFor, archiveStatusForSlug)
|
||||
|
||||
-- | Annotate body links. Headings are left alone — an affordance there
|
||||
-- would be noise. Identity when the index is empty.
|
||||
apply :: Pandoc -> Pandoc
|
||||
apply doc@(Pandoc meta blocks)
|
||||
| archiveIndexIsEmpty = doc
|
||||
| otherwise = Pandoc meta (map annotateBlock blocks)
|
||||
|
||||
annotateBlock :: Block -> Block
|
||||
annotateBlock h@Header{} = h
|
||||
annotateBlock b = walk annotateInlines b
|
||||
|
||||
-- | For each archived @Link@: flip it if the target is 'Rotted', else
|
||||
-- append the affordance. Non-archived links pass through untouched.
|
||||
annotateInlines :: [Inline] -> [Inline]
|
||||
annotateInlines = concatMap expand
|
||||
where
|
||||
expand l@(Link attr text (url, _)) =
|
||||
case archiveSlugFor url of
|
||||
Nothing -> [l]
|
||||
Just slug -> case archiveStatusForSlug slug of
|
||||
Rotted -> [flipped slug attr text, marker slug "rotted"
|
||||
"The original is a dead link — \
|
||||
\opens the local archived copy"]
|
||||
_ -> [l, marker slug "" "Archived — \
|
||||
\local preservation copy"]
|
||||
expand x = [x]
|
||||
|
||||
-- | A 'Rotted' link, redirected to the local archived copy. Keeps the
|
||||
-- link text; the @archive-rotted@ class lets CSS mark it.
|
||||
flipped :: String -> Attr -> [Inline] -> Inline
|
||||
flipped slug (ident, classes, kvs) text =
|
||||
Link (ident, "archive-rotted" : classes, kvs) text
|
||||
( T.pack ("/archive/" ++ slug ++ "/")
|
||||
, "Original link is dead \8212 opens the local archived copy" )
|
||||
|
||||
-- | The superscript marker after the link: "A" for a normal affordance,
|
||||
-- "archived" for a flipped dead link. Emitted as raw HTML so the
|
||||
-- downstream @Links@ filter (which classifies @Link@ nodes) leaves it
|
||||
-- alone. Slugs are @[a-z0-9-]@ by construction in @archive.py@.
|
||||
marker :: String -> String -> T.Text -> Inline
|
||||
marker slug modifier title = RawInline "html" $ T.concat
|
||||
[ "<sup class=\"archive-affordance", modifierClass, "\">"
|
||||
, "<a href=\"/archive/", T.pack slug, "/\" title=\"", title, "\">"
|
||||
, label, "</a></sup>"
|
||||
]
|
||||
where
|
||||
modifierClass = if null modifier
|
||||
then ""
|
||||
else " archive-affordance--" <> T.pack modifier
|
||||
label = if null modifier then "A" else "archived"
|
||||
|
|
@ -1,7 +1,23 @@
|
|||
module Main where
|
||||
|
||||
import Data.Time.Clock.POSIX (getPOSIXTime)
|
||||
import System.Directory (createDirectoryIfMissing)
|
||||
import Hakyll (hakyll)
|
||||
import Site (rules)
|
||||
|
||||
-- | Stamp the start of this build into @data/build-stamp.txt@ before
|
||||
-- Hakyll scans the provider directory. The file therefore always exists
|
||||
-- and always differs from the previous run. The telemetry pages
|
||||
-- (@/build/@, @/stats/@) @load@ it as a dependency so Hakyll recompiles
|
||||
-- them on every build instead of serving a stale cached copy when no
|
||||
-- tracked content changed. See build/Stats.hs and build/Site.hs.
|
||||
writeBuildStamp :: IO ()
|
||||
writeBuildStamp = do
|
||||
createDirectoryIfMissing True "data"
|
||||
t <- getPOSIXTime
|
||||
writeFile "data/build-stamp.txt" (show t ++ "\n")
|
||||
|
||||
main :: IO ()
|
||||
main = hakyll rules
|
||||
main = do
|
||||
writeBuildStamp
|
||||
hakyll rules
|
||||
|
|
|
|||
|
|
@ -19,6 +19,7 @@ import qualified Data.Aeson as Aeson
|
|||
import qualified Data.ByteString.Lazy.Char8 as LBS
|
||||
import qualified Data.Map.Strict as Map
|
||||
import Hakyll
|
||||
import Archive (archiveRules)
|
||||
import Authors (buildAllAuthors, applyAuthorRules)
|
||||
import Backlinks (backlinkRules)
|
||||
import BibExtras (BibExtra (..), emptyBibExtra, firstAuthorSurname, parseBibExtras)
|
||||
|
|
@ -265,6 +266,13 @@ rules = do
|
|||
-- /current.html. Re-compiles current.html when the YAML changes.
|
||||
match "data/now.yaml" $ compile getResourceBody
|
||||
|
||||
-- Per-build stamp — written by Main.main before Hakyll starts, so it
|
||||
-- always exists and always differs from the previous run. Matched
|
||||
-- (not routed) purely so the telemetry pages can `load` it as a
|
||||
-- dependency and thus recompile every build instead of serving a
|
||||
-- stale cached copy. See build/Stats.hs.
|
||||
match "data/build-stamp.txt" $ compile getResourceBody
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Homepage
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
|
@ -529,6 +537,13 @@ rules = do
|
|||
-- ---------------------------------------------------------------------------
|
||||
photographyRules
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Archive — link-archiving system: per-entry /archive/<slug>/ pages and
|
||||
-- the /archive/ index, driven by archive/manifest.yaml + PROVENANCE.json.
|
||||
-- See build/Archive.hs and ARCHIVE.md for the design.
|
||||
-- ---------------------------------------------------------------------------
|
||||
archiveRules
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Blog index (paginated)
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
|
@ -926,6 +941,13 @@ rules = do
|
|||
create ["robots.txt"] $ do
|
||||
route idRoute
|
||||
compile $ makeItem $ unlines
|
||||
-- /archive/ is *deliberately not* disallowed. Crawlers must be
|
||||
-- able to reach the wrapper pages (and snapshot.html) to see
|
||||
-- their <meta name=robots content="noindex, noarchive">; a
|
||||
-- robots.txt Disallow would block that and a URL blocked only
|
||||
-- by robots.txt can still appear in results when linked. The
|
||||
-- raw PDFs cannot carry meta — they need an `X-Robots-Tag`
|
||||
-- HTTP header from the deploy webserver (see nginx/archive.conf).
|
||||
[ "User-agent: *"
|
||||
, "Allow: /"
|
||||
, ""
|
||||
|
|
|
|||
|
|
@ -37,6 +37,7 @@ import qualified Text.Blaze.Html5.Attributes as A
|
|||
import Text.Blaze.Html.Renderer.String (renderHtml)
|
||||
import qualified Text.Blaze.Internal as BI
|
||||
import Hakyll
|
||||
import Archive (archiveBuildStats)
|
||||
import Contexts (siteCtx, authorLinksField)
|
||||
import qualified Patterns as P
|
||||
import Utils (readingTime)
|
||||
|
|
@ -707,6 +708,14 @@ renderBuild ts dur =
|
|||
, ("Last build duration", txt dur)
|
||||
]
|
||||
|
||||
-- | Link-archive coverage and health. The metric rows are computed by
|
||||
-- 'Archive.archiveBuildStats' (count, size, link-rot status breakdown,
|
||||
-- snapshot quality, visibility, orphans); this only lays them out.
|
||||
renderArchive :: [(String, String)] -> H.Html
|
||||
renderArchive metrics =
|
||||
section "archive" "Link archive" $
|
||||
dl [ (k, txt v) | (k, v) <- metrics ]
|
||||
|
||||
-- ---------------------------------------------------------------------------
|
||||
-- Static TOC (matches the nine h2 sections above)
|
||||
-- ---------------------------------------------------------------------------
|
||||
|
|
@ -726,6 +735,7 @@ pageTOC = H.ol $ mapM_ item sections
|
|||
, ("links", "Links")
|
||||
, ("epistemic", "Epistemic coverage")
|
||||
, ("output", "Output")
|
||||
, ("archive", "Link archive")
|
||||
, ("repository", "Repository")
|
||||
, ("build", "Build")
|
||||
]
|
||||
|
|
@ -743,6 +753,16 @@ statsRules tags = do
|
|||
create ["build/index.html"] $ do
|
||||
route idRoute
|
||||
compile $ do
|
||||
-- ----------------------------------------------------------------
|
||||
-- Per-build stamp dependency: data/build-stamp.txt is rewritten
|
||||
-- by Main.main on every invocation, so loading it here forces
|
||||
-- Hakyll to recompile this page each build. Without it the page
|
||||
-- is served from cache whenever no tracked content changed, and
|
||||
-- every unsafeCompiler-sourced figure below (timestamp, output
|
||||
-- stats, git, LOC) goes stale. The value itself is unused.
|
||||
-- ----------------------------------------------------------------
|
||||
_ <- load (fromFilePath "data/build-stamp.txt") :: Compiler (Item String)
|
||||
|
||||
-- ----------------------------------------------------------------
|
||||
-- Load all content items
|
||||
-- ----------------------------------------------------------------
|
||||
|
|
@ -846,6 +866,11 @@ statsRules tags = do
|
|||
(hf, hl, cf, cl, jf, jl) <- unsafeCompiler getLocStats
|
||||
(commits, firstDate) <- unsafeCompiler getGitStats
|
||||
|
||||
-- ----------------------------------------------------------------
|
||||
-- Link-archive coverage + link-rot health
|
||||
-- ----------------------------------------------------------------
|
||||
archiveMetrics <- unsafeCompiler archiveBuildStats
|
||||
|
||||
-- ----------------------------------------------------------------
|
||||
-- Build timestamp + last build duration
|
||||
-- ----------------------------------------------------------------
|
||||
|
|
@ -869,6 +894,7 @@ statsRules tags = do
|
|||
renderLinks mostLinkedInfo orphanCount (length allPIs)
|
||||
renderEpistemic epTotal withStatus withConf withImp withEv
|
||||
renderOutput outputGrouped totalFiles totalSize
|
||||
renderArchive archiveMetrics
|
||||
renderRepository hf hl cf cl jf jl commits firstDate
|
||||
renderBuild buildTimestamp lastBuildDur
|
||||
contentString = renderHtml htmlContent
|
||||
|
|
@ -897,6 +923,11 @@ statsRules tags = do
|
|||
create ["stats/index.html"] $ do
|
||||
route idRoute
|
||||
compile $ do
|
||||
-- Per-build stamp dependency — forces a recompile every build
|
||||
-- so the heatmap's "today" and all corpus figures stay current.
|
||||
-- See the /build/ rule above for the full rationale.
|
||||
_ <- load (fromFilePath "data/build-stamp.txt") :: Compiler (Item String)
|
||||
|
||||
essays <- loadAll (P.essayPattern .&&. hasNoVersion)
|
||||
posts <- loadAll ("content/blog/*.md" .&&. hasNoVersion)
|
||||
poems <- loadAll ("content/poetry/*.md" .&&. hasNoVersion)
|
||||
|
|
|
|||
|
|
@ -13,6 +13,8 @@ executable site
|
|||
hs-source-dirs: build
|
||||
other-modules:
|
||||
Site
|
||||
Archive
|
||||
ArchiveIndex
|
||||
Authors
|
||||
Catalog
|
||||
Commonplace
|
||||
|
|
@ -36,6 +38,7 @@ executable site
|
|||
Filters.Sidenotes
|
||||
Filters.Dropcaps
|
||||
Filters.Smallcaps
|
||||
Filters.Archive
|
||||
Filters.Wikilinks
|
||||
Filters.Transclusion
|
||||
Filters.EmbedPdf
|
||||
|
|
|
|||
|
|
@ -0,0 +1,45 @@
|
|||
# archive.conf — `X-Robots-Tag: noindex, noarchive` for the link archive.
|
||||
#
|
||||
# Place at /etc/nginx/snippets/archive.conf and `include` it inside the
|
||||
# levineuwirth.org server { } block, *after* security-headers.conf:
|
||||
#
|
||||
# server {
|
||||
# server_name levineuwirth.org;
|
||||
# root /var/www/levineuwirth.org;
|
||||
# ...
|
||||
# include snippets/security-headers.conf;
|
||||
# include snippets/static-assets.conf;
|
||||
# include snippets/popup-proxy.conf;
|
||||
# include snippets/archive.conf;
|
||||
# }
|
||||
#
|
||||
# Why a location header rather than robots.txt: a URL blocked by
|
||||
# robots.txt can still appear in results when externally linked, and the
|
||||
# noindex directive must be reachable. Wrapper pages carry the meta in
|
||||
# HTML, and the HTML snapshots have the same meta injected at fetch
|
||||
# time. But raw PDFs cannot carry meta directives — and a robots.txt
|
||||
# Disallow on /archive/ would prevent crawlers from reading the wrapper
|
||||
# meta in the first place. The header form is the right control for the
|
||||
# whole tree: crawlers honour it for any resource, HTML or PDF.
|
||||
#
|
||||
# `^~` makes this prefix-match take priority over any regex location
|
||||
# that might match the same path.
|
||||
|
||||
location ^~ /archive/ {
|
||||
# nginx's add_header chain is inherited from a parent context ONLY
|
||||
# when the current context declares no add_header directives — see
|
||||
# nginx.org/en/docs/http/ngx_http_headers_module.html. Adding any
|
||||
# header inside this location would silently drop the baseline
|
||||
# security headers within the /archive/ subtree, so we re-include
|
||||
# security-headers.conf to keep HSTS, CSP, X-Frame-Options, etc.
|
||||
# intact for archive pages and raw artifacts.
|
||||
include snippets/security-headers.conf;
|
||||
|
||||
# `always` so the header is emitted even on 4xx/5xx responses (the
|
||||
# default add_header only sets on 2xx/3xx; without `always` a 404
|
||||
# under /archive/ could be indexed).
|
||||
add_header X-Robots-Tag "noindex, noarchive" always;
|
||||
|
||||
# Hand off to the same static-file fallback as the rest of the site.
|
||||
try_files $uri $uri/index.html $uri.html =404;
|
||||
}
|
||||
|
|
@ -42,6 +42,12 @@ server {
|
|||
include snippets/security-headers.conf;
|
||||
include snippets/static-assets.conf;
|
||||
include snippets/popup-proxy.conf;
|
||||
# archive.conf must come *after* security-headers.conf — it declares
|
||||
# its own add_header inside `location ^~ /archive/`, which (per the
|
||||
# nginx add_header inheritance rules) would otherwise drop the
|
||||
# baseline headers within that subtree. The snippet re-includes
|
||||
# security-headers.conf inside its location to compensate.
|
||||
include snippets/archive.conf;
|
||||
|
||||
# Static-site fallback. Pretty URLs first (foo/index.html, foo.html),
|
||||
# then 404.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,463 @@
|
|||
/* archive.css — the link archive: /archive/ and /archive/<slug>/.
|
||||
*
|
||||
* Gated in head.html via $if(archive)$ (build/Archive.hs sets the flag on
|
||||
* the index and every entry page). The archive pages are structured
|
||||
* surfaces rather than prose, but they render inside #markdownBody — so
|
||||
* every rule here is scoped under #markdownBody to clear the id-specificity
|
||||
* prose rules in typography.css (heading scales, figure framing, paragraph
|
||||
* indent) that would otherwise win over a bare class.
|
||||
*
|
||||
* Treatment: "framed / structured" — the archival chrome (banner,
|
||||
* provenance panel, the embedded artifact viewer) is given visible borders
|
||||
* so a reader is never in doubt that this is a preservation copy, not the
|
||||
* original. All colour comes from tokens, so dark mode follows for free;
|
||||
* the embedded artifact itself is shown raw and is deliberately not themed.
|
||||
*/
|
||||
|
||||
/* Structured pages, not essays — no first-line indent on any paragraph. */
|
||||
#markdownBody :is(.archive-banner-text, .archive-degraded, .archive-note,
|
||||
.archive-private, .archive-status-note, .archive-index-intro,
|
||||
.archive-removal, .archive-empty),
|
||||
#markdownBody .archive-fulltext-wrap > p {
|
||||
text-indent: 0;
|
||||
}
|
||||
|
||||
/* ============================================================
|
||||
ENTRY HEADER + ARCHIVAL BANNER
|
||||
The banner is a bordered callout, stacked: a small-caps label,
|
||||
one plain-language line, and the original link given real
|
||||
weight — the original is the hero, never the archived copy.
|
||||
============================================================ */
|
||||
|
||||
#markdownBody .archive-header {
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
#markdownBody .archive-header .page-title {
|
||||
margin-bottom: 0;
|
||||
}
|
||||
|
||||
#markdownBody .archive-banner {
|
||||
margin-top: 1.4rem;
|
||||
padding: 0.9rem 1.1rem;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 0.3rem;
|
||||
border: 1px solid var(--border-muted);
|
||||
border-radius: 2px;
|
||||
background: var(--bg-subtle);
|
||||
}
|
||||
|
||||
#markdownBody .archive-banner-label {
|
||||
margin: 0;
|
||||
font-family: var(--font-sans);
|
||||
font-size: 0.7rem;
|
||||
font-variant: all-small-caps;
|
||||
font-feature-settings: "smcp" 1;
|
||||
letter-spacing: 0.13em;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
#markdownBody .archive-banner-text {
|
||||
margin: 0;
|
||||
font-family: var(--font-serif);
|
||||
font-size: 0.95rem;
|
||||
line-height: 1.5;
|
||||
color: var(--text);
|
||||
}
|
||||
|
||||
#markdownBody .archive-banner-original {
|
||||
align-self: flex-start;
|
||||
font-family: var(--font-sans);
|
||||
font-size: 0.85rem;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
/* Degraded / js-required snapshots: a dashed-border note. Restrained —
|
||||
the monochrome palette has no alarm colour and wants none. */
|
||||
#markdownBody .archive-degraded {
|
||||
margin: 1rem 0 0;
|
||||
padding: 0.7rem 1rem;
|
||||
border: 1px dashed var(--border-muted);
|
||||
border-radius: 2px;
|
||||
font-family: var(--font-serif);
|
||||
font-size: 0.9rem;
|
||||
line-height: 1.55;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
#markdownBody .archive-degraded-label {
|
||||
margin-right: 0.4rem;
|
||||
font-family: var(--font-sans);
|
||||
font-size: 0.7rem;
|
||||
font-variant: all-small-caps;
|
||||
font-feature-settings: "smcp" 1;
|
||||
letter-spacing: 0.1em;
|
||||
color: var(--text);
|
||||
}
|
||||
|
||||
/* Private entry: the artifact is held offline, not published — a calm
|
||||
informational panel in place of the artifact viewer. */
|
||||
#markdownBody .archive-private {
|
||||
margin: 1.8rem 0;
|
||||
padding: 1rem 1.2rem;
|
||||
border: 1px solid var(--border);
|
||||
border-radius: 2px;
|
||||
background: var(--bg-subtle);
|
||||
font-family: var(--font-serif);
|
||||
font-size: 0.95rem;
|
||||
line-height: 1.6;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
/* Link-rot status — a header note for non-live states (archive.py check),
|
||||
and the status word in the provenance panel. The palette is monochrome,
|
||||
so a `rotted` entry is marked by weight and a heavier left rule, never
|
||||
colour. */
|
||||
#markdownBody .archive-status-note {
|
||||
margin: 1rem 0 0;
|
||||
padding: 0.7rem 1rem;
|
||||
border: 1px solid var(--border-muted);
|
||||
border-left-width: 3px;
|
||||
border-radius: 2px;
|
||||
font-family: var(--font-serif);
|
||||
font-size: 0.92rem;
|
||||
line-height: 1.55;
|
||||
color: var(--text);
|
||||
}
|
||||
|
||||
#markdownBody .archive-status-note--rotted {
|
||||
border-left-color: var(--text);
|
||||
}
|
||||
|
||||
#markdownBody .archive-status-note--moved {
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
#markdownBody .archive-status {
|
||||
font-variant: all-small-caps;
|
||||
font-feature-settings: "smcp" 1;
|
||||
letter-spacing: 0.04em;
|
||||
}
|
||||
|
||||
#markdownBody .archive-status--live {
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
#markdownBody .archive-status--rotted {
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
/* ============================================================
|
||||
PROVENANCE PANEL
|
||||
A bordered box with a small-caps label; the metadata is a
|
||||
two-column key/value grid — labels auto-sized, values take
|
||||
the rest, long URLs and hashes wrap rather than overflow.
|
||||
============================================================ */
|
||||
|
||||
#markdownBody .archive-provenance {
|
||||
margin: 1.8rem 0;
|
||||
padding: 1rem 1.2rem 1.1rem;
|
||||
border: 1px solid var(--border);
|
||||
border-radius: 2px;
|
||||
}
|
||||
|
||||
#markdownBody .archive-panel-title {
|
||||
margin: 0 0 0.7rem;
|
||||
font-family: var(--font-sans);
|
||||
font-size: 0.72rem;
|
||||
font-weight: 600;
|
||||
font-variant: all-small-caps;
|
||||
font-feature-settings: "smcp" 1;
|
||||
letter-spacing: 0.12em;
|
||||
color: var(--text-faint);
|
||||
}
|
||||
|
||||
#markdownBody .archive-meta {
|
||||
margin: 0;
|
||||
display: grid;
|
||||
grid-template-columns: max-content 1fr;
|
||||
gap: 0.34rem 1.1rem;
|
||||
}
|
||||
|
||||
#markdownBody .archive-meta dt {
|
||||
font-family: var(--font-sans);
|
||||
font-size: 0.78rem;
|
||||
font-variant: all-small-caps;
|
||||
font-feature-settings: "smcp" 1;
|
||||
letter-spacing: 0.05em;
|
||||
color: var(--text-faint);
|
||||
}
|
||||
|
||||
#markdownBody .archive-meta dd {
|
||||
margin: 0;
|
||||
font-family: var(--font-serif);
|
||||
font-size: 0.92rem;
|
||||
color: var(--text);
|
||||
overflow-wrap: anywhere;
|
||||
}
|
||||
|
||||
#markdownBody .archive-meta dd code {
|
||||
font-family: var(--font-mono);
|
||||
font-size: 0.82rem;
|
||||
}
|
||||
|
||||
/* The author's reason-for-archiving note, set in the page measure. */
|
||||
#markdownBody .archive-note {
|
||||
margin: 1.6rem 0;
|
||||
font-family: var(--font-serif);
|
||||
font-size: 0.97rem;
|
||||
font-style: italic;
|
||||
line-height: 1.6;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
/* ============================================================
|
||||
ARTIFACT VIEWER
|
||||
A <div> (not a <figure> — that carries prose framing) with a
|
||||
mono caption bar that names the raw artifact and links to it,
|
||||
and the artifact embedded raw beneath: the PDF renders in the
|
||||
browser's native viewer, the HTML snapshot loads sandboxed.
|
||||
============================================================ */
|
||||
|
||||
#markdownBody .archive-viewer {
|
||||
margin: 1.8rem 0;
|
||||
border: 1px solid var(--border-muted);
|
||||
border-radius: 2px;
|
||||
overflow: hidden;
|
||||
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.03);
|
||||
}
|
||||
|
||||
#markdownBody .archive-viewer-bar {
|
||||
display: flex;
|
||||
align-items: baseline;
|
||||
justify-content: space-between;
|
||||
gap: 1rem;
|
||||
padding: 0.45rem 0.75rem;
|
||||
border-bottom: 1px solid var(--border-muted);
|
||||
background: var(--bg-subtle);
|
||||
}
|
||||
|
||||
#markdownBody .archive-viewer-name {
|
||||
font-family: var(--font-mono);
|
||||
font-size: 0.78rem;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
#markdownBody .archive-viewer-open {
|
||||
font-family: var(--font-sans);
|
||||
font-size: 0.76rem;
|
||||
white-space: nowrap;
|
||||
}
|
||||
|
||||
#markdownBody .archive-frame {
|
||||
display: block;
|
||||
width: 100%;
|
||||
height: 80vh;
|
||||
border: 0;
|
||||
background: var(--bg);
|
||||
}
|
||||
|
||||
/* ============================================================
|
||||
EXTRACTED FULL TEXT
|
||||
Always in the DOM, for embed.py / Pagefind. PDF text is
|
||||
collapsed in a <details> and keeps its pdftotext layout in a
|
||||
scrollable mono block; HTML text shows as serif paragraphs.
|
||||
============================================================ */
|
||||
|
||||
#markdownBody .archive-fulltext-wrap {
|
||||
margin: 1.8rem 0 0;
|
||||
}
|
||||
|
||||
#markdownBody .archive-fulltext-title,
|
||||
#markdownBody .archive-section-title {
|
||||
margin: 0 0 0.6rem;
|
||||
padding-bottom: 0.4rem;
|
||||
border-bottom: 1px solid var(--border);
|
||||
font-family: var(--font-sans);
|
||||
font-size: 0.78rem;
|
||||
font-weight: 600;
|
||||
font-variant: all-small-caps;
|
||||
font-feature-settings: "smcp" 1;
|
||||
letter-spacing: 0.1em;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
#markdownBody summary.archive-fulltext-title {
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
#markdownBody .archive-fulltext-wrap > p {
|
||||
margin: 0 0 0.85rem;
|
||||
font-family: var(--font-serif);
|
||||
font-size: 0.95rem;
|
||||
line-height: 1.6;
|
||||
color: var(--text);
|
||||
}
|
||||
|
||||
/* The pdftotext block: scroll-capped so it never dominates the page. */
|
||||
#markdownBody .archive-fulltext {
|
||||
margin: 0.8rem 0 0;
|
||||
padding: 0.9rem 1rem;
|
||||
max-height: 60vh;
|
||||
overflow: auto;
|
||||
border: 1px solid var(--border);
|
||||
border-radius: 2px;
|
||||
background: var(--bg-subtle);
|
||||
font-family: var(--font-mono);
|
||||
font-size: 0.8rem;
|
||||
line-height: 1.5;
|
||||
color: var(--text-muted);
|
||||
white-space: pre-wrap;
|
||||
overflow-wrap: anywhere;
|
||||
}
|
||||
|
||||
/* ============================================================
|
||||
REFERENCED BY / RELATED
|
||||
The site-wide .backlinks-list / .similar-links-list styles
|
||||
(components.css) carry the lists themselves; these rules add
|
||||
only the section framing and the granular fragment groups.
|
||||
============================================================ */
|
||||
|
||||
#markdownBody .archive-backlinks,
|
||||
#markdownBody .archive-related {
|
||||
margin: 1.8rem 0 0;
|
||||
}
|
||||
|
||||
#markdownBody .referenced-by-group {
|
||||
margin-top: 0.9rem;
|
||||
}
|
||||
|
||||
#markdownBody .referenced-by-fragment {
|
||||
margin: 0 0 0.3rem;
|
||||
font-family: var(--font-sans);
|
||||
font-size: 0.72rem;
|
||||
font-weight: 600;
|
||||
font-variant: all-small-caps;
|
||||
font-feature-settings: "smcp" 1;
|
||||
letter-spacing: 0.08em;
|
||||
color: var(--text-faint);
|
||||
}
|
||||
|
||||
/* ============================================================
|
||||
REMOVAL NOTICE
|
||||
A quiet italic footer line, set off by a top rule — present
|
||||
on every archive page and on the index.
|
||||
============================================================ */
|
||||
|
||||
#markdownBody .archive-removal {
|
||||
margin: 2.4rem 0 0;
|
||||
padding-top: 1rem;
|
||||
border-top: 1px solid var(--border);
|
||||
font-family: var(--font-serif);
|
||||
font-size: 0.85rem;
|
||||
font-style: italic;
|
||||
line-height: 1.55;
|
||||
color: var(--text-faint);
|
||||
}
|
||||
|
||||
/* ============================================================
|
||||
INDEX PAGE — /archive/
|
||||
A text list in the catalog idiom: one hairline between rows,
|
||||
the title in serif, type + date + any quality flag in quiet
|
||||
sans pushed to the row's end.
|
||||
============================================================ */
|
||||
|
||||
#markdownBody .archive-index-header {
|
||||
margin-bottom: 1.8rem;
|
||||
}
|
||||
|
||||
#markdownBody .archive-index-intro {
|
||||
margin: 0.6rem 0 0;
|
||||
font-family: var(--font-serif);
|
||||
font-size: 1rem;
|
||||
line-height: 1.6;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
#markdownBody .archive-list {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
list-style: none;
|
||||
}
|
||||
|
||||
#markdownBody .archive-list-item {
|
||||
display: flex;
|
||||
align-items: baseline;
|
||||
justify-content: space-between;
|
||||
gap: 0.4rem 1rem;
|
||||
flex-wrap: wrap;
|
||||
padding: 0.7rem 0;
|
||||
border-bottom: 1px solid var(--border);
|
||||
}
|
||||
|
||||
#markdownBody .archive-list-item:last-child {
|
||||
border-bottom: none;
|
||||
}
|
||||
|
||||
#markdownBody .archive-list-link {
|
||||
font-family: var(--font-serif);
|
||||
font-size: 1.05rem;
|
||||
color: var(--text);
|
||||
text-decoration: none;
|
||||
}
|
||||
|
||||
#markdownBody .archive-list-link:hover {
|
||||
text-decoration: underline;
|
||||
text-underline-offset: 2px;
|
||||
}
|
||||
|
||||
#markdownBody .archive-list-meta {
|
||||
font-family: var(--font-sans);
|
||||
font-size: 0.78rem;
|
||||
color: var(--text-faint);
|
||||
white-space: nowrap;
|
||||
}
|
||||
|
||||
/* Non-'ok' capture flag — a dashed chip, echoing the entry-page note. */
|
||||
#markdownBody .archive-quality-flag {
|
||||
padding: 0.05em 0.4em;
|
||||
border: 1px dashed var(--border-muted);
|
||||
border-radius: 2px;
|
||||
font-variant: all-small-caps;
|
||||
font-feature-settings: "smcp" 1;
|
||||
letter-spacing: 0.04em;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
/* A rotted entry is the one health state worth a solid, inked flag. */
|
||||
#markdownBody .archive-quality-flag--rotted {
|
||||
border-style: solid;
|
||||
border-color: var(--text);
|
||||
color: var(--text);
|
||||
}
|
||||
|
||||
#markdownBody .archive-empty {
|
||||
font-family: var(--font-serif);
|
||||
font-style: italic;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
/* ============================================================
|
||||
MOBILE
|
||||
Collapse the provenance grid to stacked rows; trim the frame.
|
||||
============================================================ */
|
||||
|
||||
@media (max-width: 540px) {
|
||||
#markdownBody .archive-meta {
|
||||
grid-template-columns: 1fr;
|
||||
gap: 0;
|
||||
}
|
||||
|
||||
#markdownBody .archive-meta dt {
|
||||
margin-top: 0.55rem;
|
||||
}
|
||||
|
||||
#markdownBody .archive-meta dt:first-of-type {
|
||||
margin-top: 0;
|
||||
}
|
||||
|
||||
#markdownBody .archive-frame {
|
||||
height: 70vh;
|
||||
}
|
||||
}
|
||||
|
|
@ -1849,3 +1849,50 @@ pre:hover .copy-btn,
|
|||
min-height: 300px;
|
||||
}
|
||||
}
|
||||
|
||||
/* ── Archive affordance ─────────────────────────────────────────────────────
|
||||
The superscript "A" appended after a body link whose target is preserved
|
||||
in the local archive (build/Filters/Archive.hs). Loaded site-wide because
|
||||
the marker appears in essay/prose content, not on archive pages. */
|
||||
|
||||
.archive-affordance {
|
||||
font-size: 0.7em;
|
||||
margin-left: 0.15em;
|
||||
line-height: 0;
|
||||
}
|
||||
|
||||
.archive-affordance a {
|
||||
font-family: var(--font-sans);
|
||||
font-weight: 600;
|
||||
text-decoration: none;
|
||||
color: var(--text-faint);
|
||||
border: 1px solid var(--border-muted);
|
||||
border-radius: 2px;
|
||||
padding: 0 0.25em;
|
||||
}
|
||||
|
||||
.archive-affordance a:hover {
|
||||
color: var(--text);
|
||||
border-color: var(--text-muted);
|
||||
background: var(--bg-subtle);
|
||||
}
|
||||
|
||||
/* Dead-link flip — a body link whose archived target is `rotted` has its
|
||||
href redirected to the local copy (build/Filters/Archive.hs). A dotted
|
||||
underline marks the link as redirected; its marker becomes a solid chip
|
||||
reading "archived" rather than the quiet bordered "A". */
|
||||
.archive-rotted {
|
||||
text-decoration-style: dotted;
|
||||
}
|
||||
|
||||
.archive-affordance--rotted a {
|
||||
color: var(--bg);
|
||||
background: var(--text-muted);
|
||||
border-color: var(--text-muted);
|
||||
}
|
||||
|
||||
.archive-affordance--rotted a:hover {
|
||||
color: var(--bg);
|
||||
background: var(--text);
|
||||
border-color: var(--text);
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,23 @@
|
|||
<div id="content">
|
||||
<main id="markdownBody" data-pagefind-body>
|
||||
<header class="archive-index-header">
|
||||
<h1 class="page-title">$title$</h1>
|
||||
<p class="archive-index-intro">Local snapshots of works referenced across the site, preserved against link rot. Each is an archived copy; the original is linked prominently from its page.</p>
|
||||
</header>
|
||||
|
||||
$if(has-entries)$
|
||||
<ul class="archive-list">
|
||||
$for(entries)$
|
||||
<li class="archive-list-item">
|
||||
<a class="archive-list-link" href="$entry-url$">$entry-title$</a>
|
||||
<span class="archive-list-meta">$entry-type$ · archived $entry-archived$$if(entry-degraded)$ · <span class="archive-quality-flag">$entry-quality$ capture</span>$endif$$if(entry-private)$ · <span class="archive-quality-flag">private</span>$endif$$if(entry-rotted)$ · <span class="archive-quality-flag archive-quality-flag--rotted">link rotted</span>$endif$</span>
|
||||
</li>
|
||||
$endfor$
|
||||
</ul>
|
||||
$else$
|
||||
<p class="archive-empty">Nothing archived yet.</p>
|
||||
$endif$
|
||||
|
||||
$partial("templates/partials/archive-removal-notice.html")$
|
||||
</main>
|
||||
</div>
|
||||
|
|
@ -0,0 +1,109 @@
|
|||
<div id="content">
|
||||
<main id="markdownBody" data-pagefind-body data-pagefind-filter="type:archive, status:$status$">
|
||||
<article class="archive-entry">
|
||||
<header class="archive-header">
|
||||
<h1 class="page-title">$title$</h1>
|
||||
$partial("templates/partials/archive-banner.html")$
|
||||
$if(status-note)$
|
||||
<p class="archive-status-note archive-status-note--$status$" role="note">
|
||||
$status-note$
|
||||
</p>
|
||||
$endif$
|
||||
$if(degraded)$
|
||||
<p class="archive-degraded" role="note">
|
||||
<span class="archive-degraded-label">Capture: $snapshot-quality$</span>
|
||||
Some of the original's content (images, scripted elements)
|
||||
may be missing or incomplete in this snapshot. The original
|
||||
is linked above.
|
||||
</p>
|
||||
$endif$
|
||||
</header>
|
||||
|
||||
<section class="archive-provenance" aria-label="Provenance">
|
||||
<h2 class="archive-panel-title">Provenance</h2>
|
||||
<dl class="archive-meta">
|
||||
<dt>Original</dt>
|
||||
<dd><a href="$original-url$" rel="noopener noreferrer" target="_blank">$original-url$</a></dd>
|
||||
<dt>Link status</dt>
|
||||
<dd class="archive-status archive-status--$status$">$status$</dd>
|
||||
<dt>Archived</dt>
|
||||
<dd>$archived$</dd>
|
||||
<dt>Type</dt>
|
||||
<dd>$archive-type$</dd>
|
||||
<dt>Snapshot quality</dt>
|
||||
<dd>$snapshot-quality$</dd>
|
||||
<dt>Size</dt>
|
||||
<dd>$size$</dd>
|
||||
<dt>SHA-256</dt>
|
||||
<dd><code>$sha-short$…</code></dd>
|
||||
$if(wayback)$
|
||||
<dt>Wayback</dt>
|
||||
<dd><a href="$wayback$" rel="noopener noreferrer" target="_blank">web.archive.org copy</a></dd>
|
||||
$endif$
|
||||
$if(paywalled)$
|
||||
<dt>Access</dt>
|
||||
<dd>The original sits behind a paywall.</dd>
|
||||
$endif$
|
||||
$if(private)$
|
||||
<dt>Visibility</dt>
|
||||
<dd>private — held offline</dd>
|
||||
$endif$
|
||||
</dl>
|
||||
</section>
|
||||
|
||||
$if(note)$<p class="archive-note">$note$</p>$endif$
|
||||
|
||||
$if(private)$
|
||||
<p class="archive-private" role="note">
|
||||
This work is archived <strong>privately</strong>: a local
|
||||
preservation copy is kept against link rot, but the artifact
|
||||
is not published here. Use the original link above to read it.
|
||||
</p>
|
||||
$else$
|
||||
<div class="archive-viewer">
|
||||
<div class="archive-viewer-bar">
|
||||
<span class="archive-viewer-name">$artifact-name$</span>
|
||||
<a class="archive-viewer-open" href="$artifact-url$" target="_blank" rel="noopener noreferrer">Open raw ↗</a>
|
||||
</div>
|
||||
$if(is-pdf)$
|
||||
<iframe class="archive-frame" src="$artifact-url$" title="$title$ — archived document" loading="lazy"></iframe>
|
||||
$endif$
|
||||
$if(is-html)$
|
||||
<iframe class="archive-frame" src="$artifact-url$" title="$title$ — archived snapshot" sandbox referrerpolicy="no-referrer" loading="lazy"></iframe>
|
||||
$endif$
|
||||
</div>
|
||||
$endif$
|
||||
|
||||
$if(fulltext)$
|
||||
$if(is-pdf)$
|
||||
<details class="archive-fulltext-wrap">
|
||||
<summary class="archive-fulltext-title">Full text (extracted)</summary>
|
||||
$fulltext$
|
||||
</details>
|
||||
$endif$
|
||||
$if(is-html)$
|
||||
<section class="archive-fulltext-wrap">
|
||||
<h2 class="archive-fulltext-title">Readable text (extracted)</h2>
|
||||
$fulltext$
|
||||
</section>
|
||||
$endif$
|
||||
$endif$
|
||||
|
||||
$if(referenced-by)$
|
||||
<section class="archive-backlinks">
|
||||
<h2 class="archive-section-title">Referenced by</h2>
|
||||
$referenced-by$
|
||||
</section>
|
||||
$endif$
|
||||
|
||||
$if(similar-links)$
|
||||
<section class="archive-related">
|
||||
<h2 class="archive-section-title">Related</h2>
|
||||
$similar-links$
|
||||
</section>
|
||||
$endif$
|
||||
|
||||
$partial("templates/partials/archive-removal-notice.html")$
|
||||
</article>
|
||||
</main>
|
||||
</div>
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
<div class="archive-banner" role="note">
|
||||
<p class="archive-banner-label">Archived copy</p>
|
||||
<p class="archive-banner-text">A local preservation snapshot taken $archived$ — this page is not the original.</p>
|
||||
<a class="archive-banner-original" href="$original-url$" rel="noopener noreferrer" target="_blank">View the original ↗</a>
|
||||
</div>
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
<p class="archive-removal">
|
||||
This is an archived copy, preserved so that a work cited across the site
|
||||
survives the original going dark. To request removal, email
|
||||
<a href="mailto:ln@levineuwirth.org">ln@levineuwirth.org</a>.
|
||||
</p>
|
||||
|
|
@ -2,6 +2,7 @@
|
|||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
$if(home)$<title>Levi Neuwirth</title>$else$$if(title)$<title>$title$ — Levi Neuwirth</title>$else$<title>Levi Neuwirth</title>$endif$$endif$
|
||||
$if(description)$<meta name="description" content="$description$">$endif$
|
||||
$if(noindex)$<meta name="robots" content="noindex">$endif$
|
||||
<link rel="canonical" href="$site-url$$url$">
|
||||
<link rel="alternate" type="application/atom+xml" title="Levi Neuwirth" href="/feed.xml">
|
||||
<link rel="alternate" type="application/atom+xml" title="Levi Neuwirth — music" href="/music/feed.xml">
|
||||
|
|
@ -49,6 +50,7 @@ $if(build)$<link rel="stylesheet" href="/css/build.css">$endif$
|
|||
$if(reading)$<link rel="stylesheet" href="/css/reading.css">$endif$
|
||||
$if(composition)$<link rel="stylesheet" href="/css/score-reader.css">$endif$
|
||||
$if(photography)$<link rel="stylesheet" href="/css/photography.css">$endif$
|
||||
$if(archive)$<link rel="stylesheet" href="/css/archive.css">$endif$
|
||||
$if(photography-map)$<link rel="stylesheet" href="/leaflet/leaflet.css">$endif$
|
||||
$if(photography-map)$<link rel="stylesheet" href="/leaflet/MarkerCluster.css">$endif$
|
||||
$if(photography-map)$<link rel="stylesheet" href="/leaflet/MarkerCluster.Default.css">$endif$
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load Diff
Binary file not shown.
|
|
@ -48,7 +48,16 @@ MIN_SCORE = 0.30 # similar-links: discard weak matches
|
|||
MIN_PARA_CHARS = 80 # semantic: skip very short paragraphs
|
||||
MAX_PARA_CHARS = 1000 # semantic: truncate before embedding
|
||||
|
||||
EXCLUDE_URLS = {"/search/", "/build/", "/404.html", "/feed.xml", "/music/feed.xml"}
|
||||
# /archive/ is the archive index — a list page that would dominate every
|
||||
# entry's "Related" set; the individual /archive/<slug>/ pages stay in.
|
||||
EXCLUDE_URLS = {"/search/", "/build/", "/404.html", "/feed.xml",
|
||||
"/music/feed.xml", "/archive/"}
|
||||
|
||||
# Whole subtrees kept out of the corpus. /source/ is the repository code
|
||||
# mirror — source files, not content; left in, they pollute every page's
|
||||
# "Related" set and semantic search (e.g. a template file surfacing as a
|
||||
# neighbour, titled with its unrendered "$title$" placeholder).
|
||||
EXCLUDE_PREFIXES = ("/source/",)
|
||||
|
||||
# Pages whose <body data-portal> are portal/landing pages — they aggregate
|
||||
# excerpts from many entries and would otherwise dominate every page's
|
||||
|
|
@ -122,7 +131,7 @@ def extract_page(html_path: Path) -> dict | None:
|
|||
soup = BeautifulSoup(raw, "html.parser")
|
||||
url = _url_from_path(html_path)
|
||||
|
||||
if url in EXCLUDE_URLS:
|
||||
if url in EXCLUDE_URLS or url.startswith(EXCLUDE_PREFIXES):
|
||||
return None
|
||||
body_tag = soup.body
|
||||
if body_tag is not None and body_tag.has_attr(PORTAL_BODY_ATTR):
|
||||
|
|
|
|||
|
|
@ -0,0 +1,17 @@
|
|||
# Pinned monolith binary — the HTML-snapshot tool for the link archive.
|
||||
#
|
||||
# Unlike PDF.js / Leaflet (servable assets downloaded at build time and
|
||||
# gitignored), monolith is a build-time *executable*: the binary itself is
|
||||
# committed at tools/bin/monolith so `git clone` -> `make build` needs no
|
||||
# network fetch and stays reproducible from a bare clone. See ARCHIVE.md.
|
||||
#
|
||||
# To re-vendor (version bump, or a build host on a different architecture):
|
||||
# 1. Download the matching asset from
|
||||
# https://github.com/Y2Z/monolith/releases
|
||||
# 2. Place it at tools/bin/monolith and `chmod +x`.
|
||||
# 3. Update the three values below; verify `tools/bin/monolith --version`.
|
||||
# 4. Commit the binary and this file together.
|
||||
|
||||
version = 2.10.1
|
||||
asset = monolith-gnu-linux-x86_64
|
||||
sha256 = 663ca914b078e91d5a854b4a07e913c613bbbcfe8fb11a24da1a6ab23c9205df
|
||||
Loading…
Reference in New Issue