levineuwirth.org/migrate_html.md

9.2 KiB
Raw Blame History

Migration Plan: Refactoring Stats.hs HTML Generation

This document outlines a comprehensive migration plan for refactoring build/Stats.hs from manual string concatenation to a type-safe HTML combinator library, specifically blaze-html.

Current Architecture and Issues

Currently, build/Stats.hs generates the HTML for the /build/ and /stats/ telemetry pages by manually concatenating raw strings (e.g., "<div class=\"build-bar-row\">" ++ ...).

This approach has several drawbacks:

  1. Security (XSS): It is trivial to introduce Cross-Site Scripting (XSS) vulnerabilities if dynamic content (like post titles) is not manually escaped before being interpolated into the HTML string. The audit report specifically flagged the link function for this.
  2. Correctness: It is easy to produce malformed HTML (e.g., missing closing tags, improperly nested elements, unescaped attributes) because the compiler cannot verify the structure of the string.
  3. Maintainability: Complex HTML structures (like the 52-week activity heatmap) become difficult to read, modify, and debug when buried within string interpolation logic.
  4. Elegance: It goes against the functional paradigm of building type-safe abstractions.

Proposed Solution: blaze-html

blaze-html is a fast, mature, type-safe HTML combinator library for Haskell. It allows you to construct HTML documents using native Haskell functions and operators. By ensuring text and attribute values are escaped by default, it substantially reduces XSS risk. Furthermore, it improves structural correctness and reduces malformed markup by constructing HTML through typed combinators instead of ad hoc string concatenation.

Scope: This migration covers build/Stats.hs only. The separate Site.hs JSON-string-concat issue from the audit report is a distinct fix and is not addressed here.

For SVG generation (the heatmap), we will not add blaze-svg as a dependency. It is not currently in cabal.project.freeze and adding it would risk the dependency-resolution instability the audit already flagged. Instead, SVG elements will be emitted via blaze-html's custom-element facility (Text.Blaze.Internal.customParent / customAttribute), or via a small local helper module. This achieves type-safe SVG emission without a new dependency.

1. Dependency Updates

blaze-html 0.9.2.0 is already pinned in cabal.project.freeze as a transitive dependency of Hakyll/Pandoc. The only required change is to declare it explicitly in levineuwirth.cabal.

  • Modify levineuwirth.cabal: Add blaze-html >= 0.9 && < 0.10 to the build-depends section of the site executable.
  • No freeze update required. The package is already resolved; no cabal freeze run is needed.

2. Module Imports

In build/Stats.hs, import the core blaze-html modules:

import qualified Text.Blaze.Html5            as H
import qualified Text.Blaze.Html5.Attributes as A
import           Text.Blaze.Html.Renderer.String (renderHtml)

For SVG custom elements (heatmap), use blaze-html's internal custom-element facility:

import qualified Text.Blaze.Internal as BI

Hakyll's makeItem takes a String, so renderHtml :: Html -> String is the correct renderer. Use it and stop there — the stats page is a few dozen KB at most and performance is not a concern.

3. Refactoring Strategy

The refactoring process should be approached incrementally, function by function. Crucially, intermediate functions must return H.Html, with rendering to String occurring only at the absolute outer boundary.

Phase 1: URL Sanitization and Core Helpers

While blaze-html escapes text and attributes, it does not validate URLs. An attacker could still inject javascript:alert(1) into an href attribute. We must introduce URL validation alongside our typed HTML helpers.

  • URL Validation:

    isSafeUrl is defense-in-depth: in current code every URL is produced by Hakyll's getRoute or constructed as a /tag/ string, so there is no live XSS surface. Nevertheless, include it to prevent regressions.

    The naive prefix check in string-land fails on JavaScript: (case), \tjavascript: (leading whitespace), and data:text/html attacks. Use a case-insensitive, stripped allowlist instead:

    import Data.Char (isSpace, toLower)
    
    isSafeUrl :: String -> Bool
    isSafeUrl u =
        let norm = map toLower (dropWhile isSpace u)
        in  any (`isPrefixOf` norm) ["/", "https://", "mailto:", "#"]
    
    safeHref :: String -> H.AttributeValue
    safeHref u
      | isSafeUrl u = H.stringValue u
      | otherwise   = H.stringValue "#"
    

    Note: http:// is intentionally excluded (mixed-content over HTTPS).

  • link:

    • New:
      link :: String -> String -> H.Html
      link url title = H.a H.! A.href (safeHref url) $ H.toHtml title
      
  • section:

    • New:
      section :: String -> String -> H.Html -> H.Html
      section id_ title body = do
          H.h2 H.! A.id (H.stringValue id_) $ H.toHtml title
          body
      
  • table and dl: These will utilize monadic do notation or mapM_ over lists to generate rows and cells, returning H.Html natively.

  • Static TOC builders (statsTOC, pageTOC): These also emit string-concat HTML and must be migrated here alongside the other primitives, not left for later.

Phase 2: Structural Components

Tackle the larger layout functions once the basic primitives are type-safe.

  • renderContent, renderPages, renderDistribution, renderTagsSection, renderLinks, renderEpistemic, renderOutput, renderRepository, renderBuild, renderCorpus, renderNotable, renderMonthlyVolume, renderStatsTags: All of these return String today and must be updated to return H.Html. They will compose the newly typed helper functions (section, table, dl). Example logic for a table row:
    H.tr $ mapM_ (H.td . H.toHtml) cells
    

Phase 2.5: Lift the Heatmap's Inline <style>

The current heatmap (renderHeatmap) ships a <style> block embedded inside the SVG (Stats.hs:207211). Migrate those rules to static/css/ where the rest of the heatmap CSS variables (--hm-0--hm-4) live. This is the right moment to do it — don't carry the inline style into the typed version.

Phase 3: The Heatmap (renderHeatmap)

The heatmap generation involves nested SVG elements, CSS classes, and <title> tooltips.

  • Separation of Concerns: Separate the data calculation from the rendering. Keep date, color, and layout calculations in pure data functions, and have the rendering functions handle strictly the HTML/SVG emission.
  • SVG via custom elements: Use blaze-html's Text.Blaze.Internal.customParent and customAttribute to construct SVG elements type-safely, replacing "<rect class=\"" ++ ... with typed combinators — no blaze-svg dependency required. Alternatively, define a minimal local Svg helper module (1015 lines) that wraps the most-used SVG tags (svg, rect, text_, figure) before this phase begins.

Phase 4: Integration with Hakyll

Finally, update the top-level Hakyll rules that consume these generated structures. This is the only place renderHtml should be called.

  • statsRules:
    • The content variable will now represent a single, large H.Html monad.
    • Call renderHtml exactly once to produce a String, then pass it to makeItem. The stripHtmlTags-based word-count pipeline operates on that rendered string and is unaffected.
    • The static TOC strings (pageTOC, statsTOC) are also rendered via renderHtml before being passed to constField.
    • Example:
      let htmlContent = do
              renderContent rows
              renderPages allPIs oldestDate newestDate
              -- ...
          contentString = renderHtml htmlContent
          plainText     = stripHtmlTags contentString
      

Phase 5: Testing and Auditing

  • Auditing: During migration, thoroughly search for and eliminate any remaining raw HTML helpers, pre-escaped content, or unsafe rendering patterns.
  • Testing: Add specific tests for escaping behavior to ensure security goals are met:
    • Title containing <script>alert(1)</script> renders escaped.
    • Attributes with quotes are escaped correctly.
    • Dangerous URLs (e.g., javascript:...) are rejected or rewritten by isSafeUrl/safeHref.
    • Golden/snapshot tests to ensure generated HTML still contains the expected structure.

Summary of Benefits

Completing this migration will:

  • Substantially reduce XSS risk: Text and attribute values will be escaped by default, and dangerous URLs will be validated and neutralized.
  • Improve structural correctness: Using typed combinators prevents malformed markup and enforces balanced tags.
  • Improve composability: Returning H.Html from all helper functions avoids "half-rendered" strings and double-escaping issues.
  • Improve readability and testability: Complex UI components like SVG heatmaps will be declarative, and pure data processing will be decoupled from rendering.