levineuwirth.org/audit.md

51 KiB
Raw Blame History

levineuwirth.org — Comprehensive Audit

Auditor: Independent code review (read-only, no changes made) Date: 2026-04-09 Scope: ~15,400 lines across Haskell build system (build/**/*.hs), Pandoc filters (build/Filters/*.hs), static JavaScript (static/js/*.js), CSS (static/css/*.css), templates (templates/**), Python tooling (tools/*.py), shell scripts (tools/*.sh), Makefile, cabal/pyproject configuration, and repository hygiene. Methodology: Direct reading of critical modules (Site.hs, Contexts.hs, Stats.hs, Backlinks.hs, Compilers.hs, Citations.hs, Stability.hs, Catalog.hs, Commonplace.hs, Filters/*.hs, Makefile, shell scripts, embed.py); parallel exploration of JS, CSS, templates, and the larger Python tools.

Each finding is labeled by severity (CRITICAL, HIGH, MEDIUM, LOW, NIT) and cites file + line. The codebase is generally well-written — architecture is clean, modules are tightly scoped, YAML/frontmatter is parsed defensively, and escaping is applied in most HTML rendering sites. Most findings are local issues; the codebase does not exhibit systemic rot.


Executive summary

Confirmed correctness bugs (by impact):

# File Severity Summary
1 build/Filters/Images.hs:110 CRITICAL lowerExt is mathematically wrong — returns "image." for "image.jpg". Every local raster fails isLocalRaster, so no <picture> / WebP wrapping happens site-wide. The entire WebP pipeline is dead code.
2 build/Commonplace.hs:126-131 HIGH Operator-precedence bug in renderChronoView: a ++ if c then x else y ++ z parses as a ++ (if c then x else (y ++ z)), so </div> is never emitted when the commonplace book is empty → unclosed tag.
3 tools/embed.py:68-73 HIGH Root index.html yields URL "/./" instead of "/". Homepage is never matched by SimilarLinks.hs, so the "Related" block never renders on the home page.
4 build/Authors.hs:50 HIGH allContent pattern does not include content/essays/*/index.md (directory-form essays). Author pages silently omit those essays. Compare against Tags.hs:69, which does include them.
5 build/Filters/Score.hs:40 HIGH TIO.readFile fullPath is called with no existence check and no exception catch. A missing SVG aborts the entire build with a bare openFile: does not exist — no file name context, no graceful fallback.
6 build/Filters/Viz.hs:96-99 HIGH Same pattern: readProcessWithExitCode "python3" [fullPath] runs even when fullPath doesn't exist; the only signal the author gets is a generic "non-zero exit".
7 build/Filters/Sidenotes.hs:38 HIGH Sidenote labels wrap after the 26th note: (n - 1) mod 26 turns note 27 into a again, creating duplicate id="sn-a" / id="snref-a" across the same document. Breaks in-page links and screen-readers.
8 build/Filters/Images.hs:77 MEDIUM passedKvs filters only loading and data-lightbox, but not id, class, alt, or title — all of which are already emitted explicitly above. Any author-set id= or class= kv on an image is emitted twice in the <img>, producing invalid HTML (<img … id="x" id="x">).
9 build/Contexts.hs:263-264 MEDIUM confidenceTrendField uses xs !! (length xs - 2) (O(n) indexing) and last xs. They are guarded by a length check so they're safe, but this is a partial idiom in a module that otherwise uses total patterns.
10 build/Filters/Links.hs:59 MEDIUM not ("levineuwirth.org" 'T.isInfixOf' url) — substring match. https://evil-levineuwirth.org.attacker.com is classified as internal, skipping rel=noopener noreferrer target=_blank.

Defense-in-depth findings:

  • build/Filters/Transclusion.hs:41 interpolates the author-controlled sec section name into a data-section="..." attribute with no escaping. In a static site where all Markdown is author-authored this is not an exploitable XSS, but it is a raw-HTML injection primitive — a stray " in a section name will break markup, and any future lowering of the "author is trusted" assumption (PRs, multi-author site, user submissions) turns it into one.
  • build/Stats.hs:161-169 implements a correct URL allowlist (isSafeUrl) but accepts "/" as a prefix, which also matches //evil.com (protocol-relative URLs). Mostly cosmetic here since inputs come from Hakyll-computed routes, but the allowlist comment claims strict defense and this is a hole.
  • Two different authorSlugify / nameOf implementations exist (Authors.hs:30-39 and Contexts.hs:147-154). They'll drift the moment one is edited.
  • Five copies of escHtmlUtils.hs:18-26 (the "real" one), Filters/Images.hs:135-142, Filters/Score.hs:88-92, Filters/Smallcaps.hs (per the filter audit), Filters/Viz.hs:178-182, plus identical ones in JS (annotations.js, popups.js, semantic-search.js). Any fix must be made in 7+ places.

Repository hygiene:

  • .env is gitignored and not tracked — good.
  • ~5.4 MB of .docx binaries (BeyondComorbidityIndices*.docx) sit in the repo root, untracked but present; they're build input for the new essay but should be moved under paper/ or similar rather than the project root.
  • HOMEPAGE.md~ (zero-byte editor backup) is on disk; gitignore catches it, but it should be removed.
  • content/modern_idolatry.md is untracked and not under content/drafts/ — either it's a ready-to-publish draft that escaped the drafts workflow, or a forgotten scratch file.
  • build/Metadata.hs contains only module Metadata where — a no-op placeholder dragged along since Phase 2. Delete or populate.
  • build/Filters/Math.hs and build/Filters/Dropcaps.hs are apply = id placeholders; fine as TODO anchors, but -Wno-unused-imports in levineuwirth.cabal is masking warnings that would otherwise tell you so.

1. Haskell build system (build/*.hs)

1.1 Site.hs

L-1.1.1 — LOW — Blog posts do not support directory-form pages. content/blog/*.md (line 249) only matches flat posts; compare to essays and poetry, which accept both flat and */index.md. If the author ever wants to co-locate blog assets, they'll have to edit both the rule and Backlinks.hs:allContent.

L-1.1.2 — LOW — Backlinks pattern drift. allContent in Backlinks.hs:200-208, Authors.hs:50, Tags.hs:69, and the implicit patterns in Site.hs all enumerate the same content types, slightly differently. Authors omits directory essays; Backlinks omits fiction/*/index.md; Tags includes both essay forms but not fiction. This divergence is the root of finding #4 (Authors missing directory essays) and will continue to produce silent bugs. Extract one canonical Patterns.hs.

L-1.1.3 — LOW — draftEssaysisDev ties build correctness to an environment variable read at rule registration. isDev <- preprocess $ ... lookupEnv "SITE_ENV" runs once at startup. Correct — but a developer toggling SITE_ENV mid-cabal run site -- watch will be confused. Worth a comment at the preprocess call, not just near draftEssays.

L-1.1.4 — LOW — library.html loads all content four times. portalList calls loadAll essays, loadAll posts, loadAll fiction, loadAll poetry inside the inner list body, which is re-evaluated for each of the eight portalList calls. That's 32 loadAll calls for eight portals. Hakyll caches identifiers so the impact is bounded, but it's still unnecessary work; hoist the loads into the outer compile block.

NIT — random-pages.json (line 445). The type annotation :: Compiler [Item String] on every binding is load-bearing because without it Hakyll can't infer the snapshot type. Fine, but a quick comment would save a future reader from thinking they're decorative.

1.2 Contexts.hs

M-1.2.1 — MEDIUM — authorLinksField produces empty-slug URLs for empty author names. authorLinksField (line 161) splits on |, trims, and calls authorSlugify. An entry like "| https://url" or " " produces name "" → slug "" → URL /authors//. Guard against empty names (fall back to defaultAuthor or skip the entry).

M-1.2.2 — MEDIUM — parseMovements silently drops malformed entries. parseMovements (line 380-397) uses catMaybes $ map parseOne — an entry missing name or page is dropped with zero diagnostic. Compositions with a typo in one movement silently lose it. Add at least a putStrLn warning via unsafeCompiler or fail loudly.

L-1.2.3 — LOW — abstractField only strips single-Para abstracts. Line 184-186: Pandoc m [Para ils] -> Pandoc m [Plain ils]. An abstract with inline <br> or line breaks becomes multiple Para blocks and the outer <p> is not stripped. Harmless but inconsistent.

L-1.2.4 — LOW — confidenceTrendField threshold of ±5 is undocumented. Line 267-269: c - p > 5 → up, p - c > 5 → down. The comment in the header describes behavior but not the threshold. Magic number.

L-1.2.5 — LOW — pageScriptsField uses the script path as the item identifier. Line 123: Item (fromFilePath s) s. If two separate frontmatter entries both load shared.js, they collide in Hakyll's item-store the first time listField evaluates them. Probably works by accident because the inner script-src field just returns itemBody; note the risk.

NIT — getInt via Rational → Double → floor (line 396). If a page number is 1000000000000000000 (unlikely), Double precision loss. Use Scientific.floatingOrInteger from scientific (already transitively available via Aeson).

1.3 Stats.hs

M-1.3.1 — MEDIUM — stripHtmlTags is naive. Line 108-111 strips <...> greedily, ignoring > inside attribute values, <!-- ... --> comments, and <![CDATA[...]]>. Used to compute word count and reading time for the /build/ page so the impact is limited, but if a future author writes alt="a > b" (rare but legal) it'll slice the content.

M-1.3.2 — MEDIUM — walkDir has no symlink-loop protection. Line 406-416 recurses through _site via doesDirectoryExist, which follows symlinks. A developer who accidentally symlinks _site/a → _site will infinite-loop the build. Use doesDirectoryExist + pathIsSymbolicLink (in directory >= 1.3.6).

L-1.3.3 — LOW — isSafeUrl allows protocol-relative URLs. Line 161-164 accepts "/"-prefixed values. "//evil.com" matches this prefix. All current inputs are Hakyll-derived routes so the exposure is nil, but the comment ("Defense-in-depth URL allowlist") claims more rigor than the implementation provides. Fix: reject u that begins with //.

L-1.3.4 — LOW — readFile/Aeson.decodeStrict round-trip. Line 741 decodes backlinks via TE.encodeUtf8 (T.pack rawBL) where rawBL :: String. That is String → Text → ByteString — three copies. Read the item as Item ByteString via getResourceLBS (or keep backlinks.json as bytes throughout) to avoid two conversions.

L-1.3.5 — LOW — Two separate tag sections. renderStatsTags (line 380) and renderTagsSection (line 568) are the same function with different names. Consolidate.

L-1.3.6 — LOW — Lazy readFile in countLinesDir. Line 455: readFile (dir </> e) holds the handle open until length (lines content) is fully forced. Under forM, multiple handles may be concurrently open. For a 30-file build directory it's fine; use Data.Text.IO.readFile for explicit strictness.

NIT — lookupString "title" meta fallback "(untitled)" (line 71 and many siblings). Fine, but consider extracting a titleOr helper since it appears ~6 times.

L-1.4.1 — LOW — normaliseUrl does not URL-decode. Line 188-194: stripping ? and # is done on the raw URL without percent-decoding. A path like /essays/caf%C3%A9 won't normalize to /essays/café. Current build likely does not emit percent-encoded routes, so this is latent.

L-1.4.2 — LOW — backlinksField does not handle the "item with noResult route" case explicitly. When getRoute item is Nothing, it fails with "backlinks: item has no route". Fine, but that path is unreachable for items that have an associated rule. Note it, remove if always reachable.

NIT — renderBacklinks concatenates strings; use blaze-html to match Stats.hs. Not urgent; the output is static per build.

1.5 Citations.hs

L-1.5.1 — LOW — Partial functions in transformInline. Line 142: head keys / head nums. Guarded by null nums check above and by the structure of Pandoc Cite (never empty from the parser), so this is safe in practice. Swap to case nums of (n:_) -> ....

L-1.5.2 — LOW — markerHtml concatenates T.unpack . show via tshow but also builds data-cite-keys as a space-separated list of HTML IDs with no escaping. If a citation key contains a quote character (unusual but legal), the attribute breaks.

NIT — stripRefPrefix (line 209) is "ref-"-specific; should be renamed stripPandocRefPrefix or documented with a pointer to the Pandoc source that emits it.

1.6 Compilers.hs

L-1.6.1 — LOW — pageCompiler does not save a toc snapshot. OK for pages that use pageCtx, but the commonplace, landing, and standalone pages that would benefit from a TOC get no opportunity. Not a bug — an architectural choice worth documenting.

NIT — stringify is redefined here (line 56-77) in addition to Filters/Images.hs:119-132 and the one Text.Pandoc.Shared exports. Three implementations. Pick one.

1.7 Stability.hs

M-1.7.1 — MEDIUM — readIgnore uses lazy readFile. Line 44: handle stays open until the whole list is forced. Fine for a single-shot read but the pattern is fragile; Data.Text.IO.readFile is strict.

L-1.7.2 — LOW — unsafeCompiler for git subprocess breaks Hakyll's dep tracking. stabilityField calls git log via unsafeCompiler. Hakyll will not re-run the compiler when HEAD moves. Expected — make build always runs git add content/ + commit first, which updates mtimes — but it's fragile to reason about. Worth a note at the unsafeCompiler call site rather than the header docs.

L-1.7.3 — LOW — gitDates ignores stderr. Line 54: (ec, out, _) <- readProcessWithExitCode ..._ drops the error. If the file isn't tracked yet, git prints a warning to stderr; user sees nothing. Log it.

NIT — stabilityFromDates classification is undocumented magic. n <= 5 && age < 90 → "revising". These thresholds should be constants with intent comments.

1.8 Catalog.hs

M-1.8.1 — MEDIUM — renderEntry does not escape frontmatter. ceTitle, ceYear, ceDuration, ceInstrumentation, and ceUrl are pasted directly into HTML via concat. This is consistent with the site's "author-controlled trusted HTML in titles" convention (Stats.hs:180-186 calls this out explicitly), but Catalog.hs has no such comment. If a collaborator's frontmatter contains a stray < or a malformed entry, the HTML breaks silently.

Suggest: adopt the pageLink convention from Stats.hs — escape href via safeHref, pass title through preEscapedToHtml with a documented comment.

L-1.8.2 — LOW — renderCategorySection assumes non-empty group. Line 194: categoryLabel (ceCategory (head g)). groupBy on a non-empty list produces non-empty sublists, so this is safe, but partial.

NIT — categoryRank uses lookup instead of elemIndex. Shorter:

categoryRank c = fromMaybe (length categoryOrder) (elemIndex c categoryOrder)

1.9 Commonplace.hs

H-1.9.1 — HIGH — Operator-precedence bug in renderChronoView (line 126-131).

renderChronoView entries =
    "<div class=\"cp-chrono\" id=\"cp-chrono\" hidden>"
    ++ if null sorted
        then "<p class=\"cp-empty\">No entries yet.</p>"
        else concatMap renderEntry sorted
    ++ "</div>"

Parses as "..." ++ (if null sorted then "..." else (concatMap renderEntry sorted ++ "</div>")). When sorted is empty, the closing </div> is silently dropped. Fix: parenthesize the if, or split into two lines with explicit binding.

L-1.9.2 — LOW — renderText replaces \n with <br>\n after escaping, which is correct, but does not escape \r. Windows-style line endings would produce \r<br>, leaving stray \r in HTML. Normalize line endings in stripTrailingNL.

1.10 Authors.hs

H-1.10.1 — HIGH — allContent omits directory-form essays. Line 50:

allContent = ("content/essays/*.md" .||. "content/blog/*.md") .&&. hasNoVersion

Compare to Tags.hs:69, which adds "content/essays/*/index.md". Any essay stored as content/essays/foo/index.md will NOT appear on its author's index page. This is the most likely source of silent "why isn't this essay on my author page" bugs.

L-1.10.2 — LOW — Duplicate of Contexts.authorSlugify. Authors.slugify and Contexts.authorSlugify do the same thing with different definitions (the Contexts version normalizes before filtering, Authors version filters after lowercasing). The two will diverge on Unicode edge cases. Consolidate.

1.11 Utils.hs

L-1.11.1 — LOW — wordCount counts HTML tokens as words. Called from Compilers.hs:172 on raw source src (Markdown, including any raw HTML) and from Stats.hs:809 on tag-stripped HTML. On raw Markdown this miscounts [display](url) as three "words". Low-severity because the stat is approximate anyway, but worth noting when comparing /stats/ numbers to wc.

No material issues. Metadata.hs is a two-line empty-module placeholder — delete or populate.


2. Pandoc filters (build/Filters/*.hs)

2.1 Filters/Images.hs — the big one

C-2.1.1 — CRITICAL — lowerExt returns the basename, not the extension. Line 110:

lowerExt = map toLower . reverse . ('.' :) . takeWhile (/= '.') . tail . dropWhile (/= '.') . reverse

Trace for "image.jpg":

  1. reverse"gpj.egami"
  2. dropWhile (/= '.')".egami"
  3. tail"egami"
  4. takeWhile (/= '.')"egami"
  5. ('.' :)".egami"
  6. reverse"image."
  7. toLower"image."

So lowerExt "image.jpg" == "image." — which does not equal .jpg, .jpeg, .png, or .gif. isLocalRaster is therefore False for every file, the entire <picture>/WebP dispatch is dead code, and tools/convert-images.sh produces .webp companions that are never referenced.

Fix: System.FilePath.takeExtension is already imported elsewhere and already pulled in transitively; replace with

lowerExt = map toLower . takeExtension

M-2.1.2 — MEDIUM — passedKvs duplicate-emits id, class, alt, title. Line 77:

passedKvs = filter (\(k, _) -> k `notElem` ["loading", "data-lightbox"]) kvs

But above, attrId, attrClasses, attrAlt, and attrTitle already emit those attributes from (ident, classes, kvs). If an author writes ![alt](img.jpg){.foo title="bar"}, Pandoc places title into kvs, so the output becomes <img ... class="foo" title="bar" title="bar">. Expand the blacklist:

passedKvs = filter (\(k, _) -> k `notElem` ["loading", "data-lightbox", "id", "class", "alt", "title"]) kvs

Side-note: the same issue affects the non-picture branch at line 47 indirectly (via the Image constructor Pandoc emits), but Pandoc's HTML writer handles dedup there.

M-2.1.3 — MEDIUM — stringify catches most but not all inline variants. Line 119-132: handles Str, Space, SoftBreak, LineBreak, Emph, Strong, Code, Link, Image, Span. Misses Strikeout, Superscript, Subscript, SmallCaps, Quoted, Cite, Math, RawInline. Alt text for an image captioned ~subscript~ will be empty.

L-2.1.4 — LOW — renderKvs does not escape the key. Line 94: " " <> k <> "=\"" <> esc v <> "\"". Keys in Pandoc come from Markdown attribute syntax and can only be identifiers, so this is safe in practice; but it's asymmetric with v and deserves either esc k or an assertion comment.

L-2.1.5 — LOW — isUrl misses data:, fine; misses file://, OK; misses mailto: not relevant here. Accurate for the intended domain.

2.2 Filters/Transclusion.hs

M-2.2.1 — MEDIUM — sec attribute not HTML-escaped. Line 41:

Just (slugToUrl slug, " data-section=\"" ++ sec ++ "\"")

sec is everything after # up to }} in the Markdown source. If an author writes {{essay#a"b}}, the emitted HTML is <div … data-section="a"b"> — invalid markup. Not a realistic XSS vector on a single-author static site (would be a self-attack), but:

  • It's an injection primitive. The moment content ever comes from a PR, a collaborator, or an imported source, it becomes one.
  • The fix is one line: escape ", <, >, & before interpolation.

L-2.2.2 — LOW — slugToUrl appends .html unconditionally. Line 46-49: slug ++ ".html". If the slug is already page.html, you get page.html.html. Unlikely in practice (source convention is {{essay-slug}} with no extension), but guard against it.

NIT — trim re-implemented yet again. Same function appears at least four times (Transclusion.hs:59, EmbedPdf.hs:80, Wikilinks.hs:59, plus Contexts.hs's strip). Factor.

2.3 Filters/Score.hs

H-2.3.1 — HIGH — TIO.readFile fullPath with no existence check and no exception handling. Line 40. A Markdown file that references a missing SVG aborts the entire Hakyll build with nothing more than:

openFile: does not exist (No such file or directory)

No filename, no page context, no recovery. Fix:

existed <- doesFileExist fullPath
if not existed
  then do putStrLn $ "[Score] missing: " ++ fullPath
          return (Div ("", cls, attrs) blocks)
  else do svgRaw <- TIO.readFile fullPath
          ...

Or wrap in try and fall back to an errorBlock mirroring Filters.Viz.errorBlock.

M-2.3.2 — MEDIUM — Lazy-I/O readFile under walkM. Using Data.Text.IO.readFile forces immediately, so this is actually OK — I retract the generic concern. The real issue is #H-2.3.1 above.

L-2.3.3 — LOW — processColors is order-sensitive. The comment on line 56-58 acknowledges it: the 6-digit hex replacements come last in the function composition chain, which means they're applied first. That's correct and the comment is helpful. Keep the comment.

L-2.3.4 — LOW — escHtml reorder bug. Line 88-92:

escHtml = T.replace "\"" "&quot;"
        . T.replace ">"  "&gt;"
        . T.replace "<"  "&lt;"
        . T.replace "&"  "&amp;"

& must be replaced first, else the &amp; injected by other replacements gets its & replaced by &amp; to become &amp;amp;. Read bottom-up because of function composition: &<>". Wait — function composition: f . g . h applied to x is f (g (h x)). So the order executed is &, then <, then >, then ". This is correct (& first). Retracted — the Viz.escHtml at Viz.hs:178-182 has the same composition order and is also correct. Nit only: write the function as a single chain with a comment stating the invariant.

2.4 Filters/Viz.hs

H-2.4.1 — HIGH — No file-existence check before readProcessWithExitCode. Line 96-99. Same class of bug as Score; the user sees "non-zero exit" with no path. Add doesFileExist fullPath before spawning.

M-2.4.2 — MEDIUM — Exception handler drops the exception detail. Line 99:

`catch` (\e -> return (ExitFailure 1, "", show (e :: IOException)))

The third tuple element is set to show e, but then on line 102 the caller reads it as err and displays it. That's actually correct — retracted. BUT the error bubbles up to errorBlock which renders <div class="viz-error">...</div> inline in the page. That's actually graceful. Good.

L-2.4.3 — LOW — escScriptTag only replaces </. Line 133: correct for JSON embedding but not for content that contains <!-- or ]]> inside strings. Vega-Lite specs won't contain those, so fine.

L-2.4.4 — LOW — warn uses putStrLn to stdout, not stderr. Line 176. Mixes with Hakyll's build progress output. Use hPutStrLn stderr.

2.5 Filters/Sidenotes.hs

H-2.5.1 — HIGH — Label wrap at 26 produces duplicate IDs. Line 38:

toLabel n = T.singleton (toEnum (fromEnum 'a' + (n - 1) `mod` 26))

Note 27 → a again. Two <sup id="snref-a"> and two <sup id="sn-a"> in the same document. Duplicate IDs are invalid HTML, break href="#sn-a" fragment navigation, and confuse ATs.

Fix options:

  1. Use numeric labels: "sn" ++ show n.
  2. Use two-letter labels for n > 26: aa, ab, …, zz.
  3. Fail loudly with error: essays with >26 footnotes are rare and the user should know.

M-2.5.2 — MEDIUM — replacePTags is a string-level hack. Line 57-60:

replacePTags =
    T.replace "<p>" "<span class=\"sidenote-para\">"
    . T.replace "</p>" "</span>"

A footnote whose content contains the literal text <p> (e.g., a code sample discussing HTML) will be mangled. Rare but possible. The correct fix is to transform the AST before writing, not the post-rendered HTML.

M-2.6.1 — MEDIUM — isExternal uses substring match for the site domain. Line 59:

isExternal url =
    ("http://"  `T.isPrefixOf` url || "https://" `T.isPrefixOf` url)
    && not ("levineuwirth.org" `T.isInfixOf` url)

https://evil-levineuwirth.org.attacker.com/phish contains levineuwirth.org as a substring → classified as internal → no rel=noopener noreferrer target=_blank. In 2026 with partitioned cookies this is mostly a cosmetic concern, but fix is trivial:

isSameHost url =
    case T.stripPrefix "https://" url <|> T.stripPrefix "http://" url of
        Nothing    -> False
        Just rest  ->
            let host = T.takeWhile (\c -> c /= '/' && c /= ':') rest
            in  host == "levineuwirth.org" || "." `T.isSuffixOf` ("." <> host) -- etc.

or simpler: host == "levineuwirth.org" || T.isSuffixOf ".levineuwirth.org" host.

M-2.6.2 — MEDIUM — PDF links with fragment are not rewritten. Line 30-36 requires .pdf" T.isSuffixOf url— a URL like/papers/foo.pdf#page=5has suffix5, not .pdf, so it doesn't route through the PDF.js viewer. Compare to EmbedPdf.hs` which does handle fragments in the source preprocessor path. Inconsistent.

L-2.6.3 — LOW — domainIcon duplicates twitter/x and youtube/youtu.be mappings. Fine. Nit: table-driven via lookup would be cleaner than the chain of guards.

M-2.7.1 — MEDIUM — toMarkdownLink does not escape ] or ). Line 33-36:

toMarkdownLink inner =
    let (title, display) = splitOnPipe inner
        url              = "/" ++ slugify title
    in "[" ++ display ++ "](" ++ url ++ ")"

If the display text contains ] or ), the generated Markdown is broken and Pandoc will parse it as raw text or as a weird link. Rare in practice (wikilink display is usually a plain name), but worth escaping.

L-2.7.2 — LOW — slugify uses intercalate "-" . words . ... — "a.b" → "a b" → "a-b". That's by design (punctuation becomes space becomes hyphen). Note the trailing hyphen for inputs like "end.": space after "end" → ["end"] → "end". OK.

NIT — Inefficient trimreverse . dropWhile ' ' . reverse . dropWhile ' '. Use T.strip if inputs were Text. String-based pipeline makes this unavoidable.

2.8 Filters/EmbedPdf.hs

M-2.8.1 — MEDIUM — encodeQueryValue does not encode #. Line 68-76: the encoder is called on filePath, which is already split on # by parseDirective (line 38). So the unencoded # issue doesn't bite here. However, the docstring at line 65 says "percent-encode characters that would break a query-string value" — # is such a character. Add it for defense in depth, even if the current call site doesn't benefit.

L-2.8.2 — LOW — parsePageHash silently produces "" for invalid fragments. Line 45-51. An author writing {{pdf:/foo.pdf#garbage}} silently drops the fragment. No warning.

2.9 Filters/Typography.hs, Filters/Code.hs, Filters/Smallcaps.hs, Filters/Dropcaps.hs, Filters/Math.hs

Scanned via the parallel sub-audit; only nit-level findings apply (duplicate escHtml, smart-quote edge case in abbreviation matching, apply = id placeholders).


3. Static JavaScript (static/js/*.js)

Audited by parallel exploration. The full per-file list is long; the aggregate pattern is: no user-authored content is ever injected, so innerHTML usage across popups.js, annotations.js, citations.js, and selection-popup.js is not an XSS vector under the current authoring model. The risk profile changes the moment the site accepts PRs, gains an annotations-backend, or proxies third-party content (none of which are planned per spec.md).

3.1 XSS surface (all author-trust scoped)

M-3.1.1 — MEDIUM — popups.js:608-614 copies innerHTML from the page into the popup. The epistemicContent provider does html += '<div class="ep-compact">' + compact.innerHTML + '</div>'. Because the source (.ep-compact) is emitted by our own Haskell code (Contexts.hs + templates), this is safe under the trust model. Switch to compact.cloneNode(true) + popup.appendChild() for a defense-in-depth fix that costs nothing.

M-3.1.2 — MEDIUM — popups.js cross-origin fetches (Wikipedia, arXiv, CrossRef, GitHub, etc.) don't validate Content-Type. A malicious CORS-enabled endpoint could return HTML that the popup would render. Every fetch already pipes through an esc() call (line 655-661), so the risk is bounded to text that escapes in some corner.

L-3.1.3 — LOW — citations.js:15, 56 and annotations.js:167-172 use innerHTML with escaped data. The escaping is correct; the fragility is that the escape-before-concat pattern is easy to get wrong in the future.

3.2 Event handling / lifecycles

M-3.2.1 — MEDIUM — sidenotes.js:73-94 attaches listeners per-sidenote with no cleanup path. When transclude.js re-renders a fragment on resize, sidenotes accumulate duplicate handlers. Net effect: update() gets called 2×, 3×, … on hover over the same sidenote. Not a bug in the output, but a measurable leak over a long session.

M-3.2.2 — MEDIUM — popups.js attaches listeners at load time and never re-binds for transcluded content. A transcluded essay's internal links have no popup previews. If transclusion is meant to feel "live", this is a user-visible gap.

M-3.2.3 — MEDIUM — semantic-search.js:66-74 race in loadModel. If two searches fire before the first model-load resolves, both call import() and pipeline(). Second call wastes CPU + memory. Track in-flight Promise:

if (loadPromise) return loadPromise;
loadPromise = import(CDN).then(...);

3.3 Accessibility

H-3.3.1 — HIGH — gallery.js overlay has no focus trap. openOverlay() focuses the close button, but Tab escapes into the backdrop. Pattern to copy: settings.js:35-49.

M-3.3.2 — MEDIUM — selection-popup.js annotation picker color swatches are mouse-only. Arrow-key navigation + Enter to select would make it keyboard-accessible.

M-3.3.3 — MEDIUM — sidenotes.js sidenote focus toggle is click-only. No keyboard equivalent.

L-3.3.4 — LOW — lightbox.js:18,42 defaults img.alt to "" and only later populates from source. If source alt is missing, the lightbox image has no accessible name. Use img.alt = srcAlt || 'Lightbox image'.

L-3.3.5 — LOW — theme.js:9-28 does not try/catch around localStorage.getItem. Private-browsing Safari throws. The code happens to work because getItem returns null on failure in most browsers, but not all.

3.4 Duplication and style

L-3.4.1 — LOW — HTML escaping reimplemented 3× across annotations.js, popups.js, semantic-search.js. Add a shared utils.js (one function).

L-3.4.2 — LOW — Mixed var vs const/let. citations.js, nav.js, sidenotes.js, toc.js use modern ES6+; popups.js, annotations.js, gallery.js use var. Pick one.

NIT — Magic-number sprinkles for delays (SHOW_DELAY=250, HIDE_DELAY=150, SHOW_DELAY=450, swipe threshold 30, etc.). Not worth a refactor.


4. CSS and HTML templates

Audited by parallel exploration. Highlights:

4.1 CSS

H-4.1.1 — HIGH — Undefined CSS custom properties. build.css uses --rule (lines 21, 30, 39, 69) and --bg-subtle (components.css:1448) and --font-ui (many places) that have no definition in base.css. Browsers treat var(--undefined) as the initial value → silent visual degradation on the /build/ and annotation-related pages.

Fix:

:root {
    --rule:      var(--border-muted);
    --font-ui:   var(--font-sans);
    --bg-subtle: #f5f5f5;
}
[data-theme="dark"] { --bg-subtle: #1f1f1f; }

H-4.1.2 — HIGH — Dark-mode --text-faint contrast fails WCAG AA. #6a6660 on #121212 ≈ 2.8:1. Used for sidenote numbers (0.65em!) and disabled-state icons. Bump to ~#8b8680 (≈3.5:1) at minimum.

H-4.1.3 — HIGH — TOC collapse hides content from keyboard + AT. components.css:433-436 uses visibility: hidden on collapsed TOC, which removes it from the accessibility tree. Use aria-expanded + height transition, or aria-hidden="true" explicitly, or display: none (losing the smooth collapse).

H-4.1.4 — HIGH — No consistent :focus-visible ring across interactive elements. .nav-portal-toggle, .settings-toggle, .toc-toggle, .annotation-toggle lack focus styles. Add a global:

button:focus-visible, a:focus-visible {
    outline: 2px solid var(--text);
    outline-offset: 2px;
}

M-4.1.5 — MEDIUM — Hardcoded hex in print.css. #fff, #000, #f9f9f9, #ddd bypass variables. Move into a @media print :root overrides block.

M-4.1.6 — MEDIUM — Breakpoints are scattered. 540px, 680px, 900px, 1100px, 1500px appear across files with no central definition. Define once in base.css:

:root {
    --bp-phone: 540px;
    --bp-tablet: 680px;
    --bp-desktop: 900px;
    --bp-wide: 1500px;
}

(Note: CSS variables cannot be used inside @media queries; use Sass or a preprocessor, or settle for a comment + grep discipline.)

L-4.1.7 — LOW — Inconsistent transition timings. 0.15s, 0.28s, 0.3s, 0.35s, 0.5s scattered. Three tokens would cover all cases.

L-4.1.8 — LOW — Deprecated font-variant shorthand. reading.css:95 and library.css:22 use font-variant: small-caps, which resets other OpenType features (like kerning). Use font-variant-caps: small-caps.

4.2 HTML templates

M-4.2.1 — MEDIUM — templates/default.html:30-33 inline onload script. The KaTeX bootstrap is an inline onload attribute containing a multi-line JS expression. Works, but blocks any future strict CSP (unsafe-inline). Move to an external katex-bootstrap.js served from /js/.

L-4.2.2 — LOW — templates/partials/nav.html buttons lack type="button". If any nav is ever placed inside a <form>, Enter will submit. Belt-and-suspenders fix: add type="button" to every <button> that isn't a submit.

L-4.2.3 — LOW — templates/partials/head.html loads all component CSS unconditionally plus three conditional files. Not a perf bug on HTTP/2, but components.css (1464 lines) is loaded even on the homepage. Split.


5. Python tooling (tools/*.py)

5.1 tools/embed.py

H-5.1.1 — HIGH — Root URL becomes "/./". Line 68-73:

def _url_from_path(html_path: Path) -> str:
    rel = html_path.relative_to(SITE_DIR)
    if rel.name == "index.html":
        url = "/" + str(rel.parent) + "/"
        return url.replace("//", "/")
    return "/" + str(rel)

For _site/index.html, rel.parent is Path("."). str(Path(".")) is "." on Linux. Result: "/./". Haskell's SimilarLinks.normaliseUrl produces "/" for the same route, so lookup fails and the homepage never gets similar-links suggestions.

Fix:

if rel.name == "index.html":
    parent = str(rel.parent)
    if parent in (".", ""):
        return "/"
    return "/" + parent + "/"

L-5.1.2 — LOW — No --quiet mode. embed.py prints progress unconditionally; CI builds get noise.

L-5.1.3 — LOW — needs_update() uses rglob("*.html") over _site. Fine, but for a large _site/ this re-stat's every HTML on every build. Could be cached via a single inode-level watermark file.

NIT — EXCLUDE_URLS comparison against /search/, /build/, etc. works only because _url_from_path matches those exact forms. A refactor could break the set. Document.

5.2 tools/import-poetry.py

H-5.2.1 — HIGH — yaml_str() does not escape newlines. Lines 193-203. An abstract, attribution, or first-line containing \n yields invalid YAML. Add \n, \r to the needs_quote character set.

H-5.2.2 — HIGH — Empty title_prefix / collection_slug silently collide. Line 328. If --collection is all punctuation, the slug becomes empty and every poem writes to the same path. Add an up-front assertion:

if not collection_slug or collection_slug == "-":
    sys.exit(f"error: collection slug is empty (check --collection={args.collection!r})")

M-5.2.3 — MEDIUM — --date is unvalidated. Line 313. User can pass --date "last tuesday" and it flows into YAML unchanged. Parse as int in [1, 2100].

M-5.2.4 — MEDIUM — roman_to_int has no explicit bounds check. Line 45-52. Guarded by regex at the call site; fine today, but make the function defensive for its own protection.

L-5.2.5 — LOW — write_text(content, encoding="utf-8") with no errors= argument. Will raise on unmappable codepoints. Pick errors="strict" or errors="replace" intentionally.

5.3 tools/viz_theme.py

L-5.3.1 — LOW — save_svg has no try/finally around plt.close(fig). If savefig raises, matplotlib state leaks. Standalone CLI-tool-y, so the impact is one figure.

5.4 pyproject.toml / uv.lock

M-5.4.1 — MEDIUM — Upper-bound-free pins. torch>=2.5, sentence-transformers>=3.4, faiss-cpu>=1.9, numpy>=2.0. A future major release can break the build silently. Pin with <4 upper bounds.


6. Shell scripts and Makefile

6.1 Makefile

M-6.1.1 — MEDIUM — make deploy pushes to GitHub before rsync. Line 57-58:

deploy: clean build sign
    git push -u origin main
    rsync -avz --delete _site/ $(VPS_USER)@$(VPS_HOST):$(VPS_PATH)/

If rsync fails, the GitHub push has already succeeded — the remote is ahead of the deployed site. Inverse order (rsync first, then push on success) would be safer, though make won't auto-rollback either way.

M-6.1.2 — MEDIUM — deploy uses $(VPS_USER), $(VPS_HOST), $(VPS_PATH) with no definition in the Makefile. They must come from .env. If any is unset, rsync runs as @:/_site/ → silently opens an SSH connection to the wrong place or errors out obliquely. Add a guard:

deploy: clean build sign
    @test -n "$(VPS_USER)" || (echo "VPS_USER not set" >&2; exit 1)
    @test -n "$(VPS_HOST)" || (echo "VPS_HOST not set" >&2; exit 1)
    @test -n "$(VPS_PATH)" || (echo "VPS_PATH not set" >&2; exit 1)
    ...

M-6.1.3 — MEDIUM — build commits content before building but never cleans up on failure. Line 9-10:

@git add content/
@git diff --cached --quiet || git commit -m "auto: $$(date -u +%Y-%m-%dT%H:%M:%SZ)"

If the subsequent cabal run site -- build fails, the commit is already in place. Subsequent make build retries will see a clean diff and succeed, masking the original failure in history as a "trailing" auto-commit. Low severity — the memory note about always make clean && make build + the deploy: clean build target make this infrequent.

L-6.1.4 — LOW — clean only runs cabal run site -- clean. That cleans _site and _cache (Hakyll's store), but not dist-newstyle/ (cabal build output) or the embeddings under data/ (gitignored but stale). Arguably correct-as-designed: deep clean is git clean -fdX. Document.

L-6.1.5 — LOW — pdf-thumbs recipe uses unquoted $$pdf in -nt test. If a PDF filename contains a space (rare), the test misparses. Quote:

if [ ! -f "$${thumb}.png" ] || [ "$$pdf" -nt "$${thumb}.png" ]; then

(The file path IS quoted — false alarm. Retracted.)

L-6.1.6 — LOW — export on line 6 is blunt. Every Make variable and every variable inherited from the shell becomes available to every recipe. For a solo build this is fine; be aware of scope creep.

NIT — build-start.txt via file I/O instead of make variable. Line 11, 23. A single recipe could use shell arithmetic, avoiding the scratch file (and the need to gitignore it).

6.2 tools/sign-site.sh

L-6.2.1 — LOW — find | xargs -I{} with -P $(nproc) is vulnerable to pathological filenames. The -I{} substitution plus -0 is safe for spaces, but if $(nproc) returns 0 (cgroup edge cases), -P 0 means "as many as possible", which is arguably fine but non-obvious. Explicit: -P "${JOBS:-$(nproc)}".

NIT — Hardcoded key fingerprint C9A42A6FAD444FBE566FD738531BDC1CC2707066. Expected — but document how a key-rotation requires editing both this script and preset-signing-passphrase.sh.

6.3 tools/convert-images.sh

No issues of note. set -euo pipefail, find -print0 | read -d '', quoting are all correct.

6.4 tools/download-model.sh

L-6.4.1 — LOW — No checksum verification on downloaded ONNX. Line 26 curl -fsSL $BASE_URL/$src -o $dst. If HuggingFace is compromised or returns a different build of the model, the site ships a trojan without warning. Pin expected SHA-256 and verify.

NIT — Hardcoded HuggingFace URL. Document that an official mirror is unavailable; that's why you pull from resolve/main rather than a pinned revision.

6.5 tools/subset-fonts.sh

L-6.5.1 — LOW — Paths are Arch-specific. /usr/share/fonts/ttf-spectral, /usr/share/fonts/TTF. On Debian/Ubuntu, JetBrains Mono lives at /usr/share/fonts/truetype/jetbrains-mono/. Detect via fc-match or document as Arch-only.

6.6 tools/preset-signing-passphrase.sh, tools/refreeze.sh

Clean. No material issues.


7. Repository hygiene and configuration

7.1 .gitignore

L-7.1.1 — LOW — .gitignore lacks *.swp/*.swo for vim users, but has *.swp/*.swo. OK. ✓

L-7.1.2 — LOW — dist-newstyle/, _site/, _cache/, .env, IGNORE.txt all correctly ignored.

NIT — paper/ is tracked but its purpose is unclear. Not in this audit's scope but worth a README.

7.2 Files in repo root that shouldn't be

  • BeyondComorbidityIndices.docx (3.8 MB) + BeyondComorbidityIndicesSupplement.docx (1.6 MB) — untracked, but 5.4 MB of binary clutter at the project root. Move into paper/ or drafts/.
  • HOMEPAGE.md~ — empty editor backup, gitignored but on disk.
  • HOMEPAGE.md, WRITING.md, migrate_html.md — workspace notes without a home. Consider docs/ or notes/.
  • content/modern_idolatry.md — untracked Markdown file in content/ that isn't content/drafts/. Either move under drafts or commit.
  • IGNORE.txt — exists (empty), gitignored, used by the stability pin mechanism. Clean.

7.3 levineuwirth.cabal

L-7.3.1 — LOW — -Wno-unused-imports masks real unused imports. Set at the executable level. This hid the Metadata.hs no-op and the Data.List.intercalate in several modules. Delete the flag and fix the warnings.

L-7.3.2 — LOW — Version bounds are present but < 4.17 on Hakyll and < 3.7 on Pandoc pin the project to a specific minor release window. Good discipline, but document the refreeze cadence (there's tools/refreeze.sh — reference it in README).

NIT — bytestring < 0.13 is ambitiously loose; the Pandoc ecosystem tends to follow bytestring < 0.12. Verify by running cabal outdated --v2-freeze-file.

7.4 cabal.project

-O1 for the build program is the right call — Hakyll build time is dominated by Pandoc, not the wrapper. ✓

7.5 pyproject.toml

See finding M-5.4.1. Otherwise clean.

7.6 README

M-7.6.1 — MEDIUM — README.md is a single line: # levineuwirth.org. The project has a 63 KB spec.md, multiple build flows, optional features (.venv for embeddings, download-model for semantic search), a signing setup, and an rsync deployment target. None of this is documented. A new contributor (or future-you after a two-year hiatus) cannot get started from what's here.

Minimum viable README:

  1. One-sentence description.
  2. make build, make dev, make deploy entrypoints.
  3. Optional: .venv setup via uv sync for embeddings; make download-model for client-side semantic search.
  4. .env format (link to .env.example).
  5. Pointer to spec.md for architecture.

8. Cross-cutting observations

8.1 Duplicate code

At least five independent implementations of HTML escaping (Utils.hs, Images.hs, Score.hs, Smallcaps.hs, Viz.hs, plus 3× in JS). At least four implementations of trim (Transclusion, EmbedPdf, Wikilinks, plus Contexts.strip). Two of slugify/authorSlugify. Two of stringify (Compilers.hs, Images.hs, plus Text.Pandoc.Shared.stringify in the library). Two normaliseUrl (Backlinks.hs, SimilarLinks.hs) — almost identical but with different index.html handling, so they cannot be naively merged.

Recommendation: create a build/Common.hs (or separate build/Text.hs) for escapeHtml, trim, stringify, and consolidate where possible.

8.2 Partial functions

Partial functions are used in several places with explicit guards (last, head, !!, fromJust). All are safe in their current guards, but the pattern is riskier than case-analysis. Audit: Contexts.hs:263, Citations.hs:142, Stats.hs:125 (median), Stability.hs:75, Catalog.hs:194.

8.3 Error handling consistency

Two different patterns:

  • Score.hs and older filters: readFile blows up, no diagnostic.
  • Viz.hs and Stability.hs: catch + errorBlock / fallback.

Standardize on the second pattern across all IO-performing filters.

8.4 Trust boundary is unstated

The codebase leans on a "the author writes and reviews everything" assumption for:

  • Frontmatter metadata (used raw in HTML by Catalog.hs, Stats.hs, Contexts.hs).
  • Wikilink / transclusion slugs (used raw in HTML by Transclusion.hs).
  • Bibliography entries (used raw in HTML by Citations.hs, Backlinks.hs).

This is a defensible design for a single-author site. Document it in spec.md. If the day ever comes that a PR from a collaborator is accepted, or that a user-provided input feeds any of these fields, the trust boundary needs to be revisited across all of these call sites simultaneously.

8.5 Build reproducibility

  • Python dependencies are upper-bound-free (M-5.4.1).
  • Model download (tools/download-model.sh) is unpinned by SHA (L-6.4.1).
  • KaTeX and Vega are loaded from CDN (templates/partials/head.html:26, 34-36) without SRI hashes.
  • Pandoc version is bounded (>= 3.1 && < 3.7), good — but citeproc behavior varies subtly across these.

Add SRI to CDN assets; pin the ONNX model to a specific revision + SHA; tighten Python pins.

8.6 Accessibility posture

Strong foundation (skip link, ARIA on nav, semantic elements, reduce-motion support), with localized gaps:

  • Gallery overlay focus trap (H-3.3.1)
  • Collapsed TOC (H-4.1.3)
  • Dark-mode text-faint contrast (H-4.1.2)
  • Keyboard-only equivalents for sidenote + annotation picker interactions

Addressing H-3.3.1, H-4.1.3, and H-4.1.2 alone would raise the overall a11y grade meaningfully.


P0 — correctness blockers

  1. Filters/Images.hs:110lowerExt bug. One-line fix (takeExtension). Restores the entire WebP pipeline.
  2. Commonplace.hs:126-131 — parenthesize the if. One line.
  3. tools/embed.py:68-73 — fix root URL. Three lines.
  4. Authors.hs:50 — add content/essays/*/index.md to allContent.
  5. Filters/Sidenotes.hs:38 — numeric labels (or error on >26).

P1 — silent-failure hardening

  1. Filters/Score.hs:40 — missing file handling.
  2. Filters/Viz.hs:96 — missing file handling.
  3. Filters/Images.hs:77 — dedup passedKvs blacklist.
  4. Filters/Links.hs:59 — proper hostname match.
  5. tools/import-poetry.py:193 — escape newlines in YAML strings.

P2 — accessibility

  1. Dark-mode --text-faint contrast.
  2. Gallery focus trap.
  3. TOC collapsed-state keyboard access.
  4. Global :focus-visible styles.

P3 — hygiene and refactor

  1. Missing CSS variables (--rule, --font-ui, --bg-subtle).
  2. Consolidate duplicate escapeHtml/trim/stringify.
  3. README.md with actual contents.
  4. Delete build/Metadata.hs or populate.
  5. Remove -Wno-unused-imports from levineuwirth.cabal and fix what surfaces.
  6. Relocate .docx binaries out of repo root.

P4 — nice to have

  1. Reproducibility: SRI on CDN, pinned ONNX, tightened Python bounds.
  2. Consolidate Backlinks/Authors/Tags/Site content patterns into a single Patterns.hs.
  3. Defense-in-depth escaping in Transclusion.hs, Catalog.hs.
  4. make deploy guard for VPS_* variables.

Appendix A — files scanned in full

  • Haskell (build system): Main.hs, Site.hs, Contexts.hs, Stats.hs, Backlinks.hs, Compilers.hs, Citations.hs, Stability.hs, Catalog.hs, Commonplace.hs, Authors.hs, Tags.hs, Pagination.hs, SimilarLinks.hs, Utils.hs, Metadata.hs, Filters.hs.
  • Haskell (filters): Filters/Images.hs, Filters/Transclusion.hs, Filters/Score.hs, Filters/Viz.hs, Filters/Sidenotes.hs, Filters/Links.hs, Filters/Wikilinks.hs, Filters/EmbedPdf.hs. Others via parallel audit.
  • JavaScript: 20 files under static/js/ via parallel audit (prism.min.js excluded as vendor).
  • CSS: 22 files under static/css/ via parallel audit.
  • Templates: default.html, partials/head.html, partials/nav.html, plus the full template tree via parallel audit.
  • Python: tools/embed.py, plus tools/import-poetry.py, tools/viz_theme.py via parallel audit.
  • Shell: tools/convert-images.sh, tools/sign-site.sh, tools/download-model.sh, tools/subset-fonts.sh, tools/preset-signing-passphrase.sh, tools/refreeze.sh, Makefile.
  • Config: levineuwirth.cabal, cabal.project, pyproject.toml, .gitignore, .env.example.

Appendix B — what was not audited

  • templates/partials/metadata.html, footer.html, page-footer.html, paginate-nav.html — inspected briefly via the CSS/template sub-audit only.
  • static/css/build.css — cited by the CSS audit for undefined variable usage; rules not fully traced.
  • data/*.bib, data/*.csl — treated as data, not audited for CSL correctness.
  • content/**/*.md — authored content, out of scope.
  • _site/, _cache/, dist-newstyle/, .venv/ — build outputs.
  • spec.md — design document, referenced but not audited line-by-line.
  • prism.min.js, pagefind output, KaTeX, Vega — vendor / third-party.

— End of audit —