From 2469c34676bf16caed7a41d920b88e8d56bb48a3 Mon Sep 17 00:00:00 2001 From: Levi Neuwirth Date: Sat, 18 Apr 2026 17:00:36 -0400 Subject: [PATCH] add TECHNICAL.md engineering roadmap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 0 (C-enabling pipeline work) → Phase 1 (Paper C clinical validation) → Phase 2 (open-source release + Paper A), with Track 2 (clinical platform) as a contingent side track. Mirrors RESEARCH.md but for engineering scope rather than methodology. --- TECHNICAL.md | 1191 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1191 insertions(+) create mode 100644 TECHNICAL.md diff --git a/TECHNICAL.md b/TECHNICAL.md new file mode 100644 index 0000000..b48238b --- /dev/null +++ b/TECHNICAL.md @@ -0,0 +1,1191 @@ +# NeuroPose Technical Ideation Notes + +A living engineering roadmap, parallel to `RESEARCH.md`. Where +`RESEARCH.md` captures open methodological questions (DTW, skeleton +choice, hosting the model), this document captures open *engineering* +questions — release readiness, operability, scaling — and the paths +they could take. + +This is **not** user-facing documentation. Items here are *candidates* +for future work, and inclusion does not imply commitment. + +## How to use this document + +- Add a section when you start thinking about a new area of technical + investment. +- Each section should end with a **Scope**, **Sketch**, or **Open + questions** block so it's obvious to a future you (or a new + contributor) what the concrete next move would be. +- When an item in here is decided and implemented, move it to the + relevant place in `docs/` or in the code itself, and leave a short + pointer behind (*See `docs/deployment.md` for the resolved design.*). +- The audience is anyone maintaining the codebase — Levi, David, + Praneeth, Dr. Shu, and whoever comes after us. Assume competence in + Python and systems work; don't assume familiarity with our specific + tooling choices. + +## Three phases, then a contingent track + +There are four distinct technical objectives, ordered by timeline and +by what each enables next. The sequencing is deliberate: each phase +unblocks the next, and doing them in any other order either publishes +Paper C on top of a pipeline its own design notes disavow, or delays +the open-source release past the window where the accompanying paper +is still salient. + +1. **Phase 0 — C-enabling pipeline work.** A targeted subset of + engineering work that has to land *before* Paper C can start. The + DTW defaults shipped in 0.1 are explicitly a "mechanical port, not + a methodological choice" (see `RESEARCH.md` §1); running the + clinical validation study on them would mean publishing results + from a pipeline the accompanying design notes explicitly criticize. + Phase 0 fixes the analyzer's methodological foundations (Procrustes + preprocessing, cycle segmentation, joint-angle DTW representation), + locks in the reproducibility surface (`Provenance` subobject, + YAML-configurable analysis pipeline), and sets up schema migration + so data generated during Phase 1 survives the long write-up. + **Near-term, well-scoped, weeks of work.** + +2. **Phase 1 — Paper C: clinical validation study.** The planned + clinical-methods paper: cycle-aware joint-angle DTW for clinical + gait similarity, validated against MoCap ground truth and/or + clinician ratings. Gated on MoCap data access via Dr. Shu. This is + research work, not engineering work — this document describes the + engineering scaffolding *around* it, not the paper itself. Phase 2 + work can happen in the background during this phase as ideal filler + for research-burnout cycles. **Months; timeline driven by data + access and experimental design.** + +3. **Phase 2 — Coordinated open-source release + Paper A.** The + engineering-paper companion (A) describing the tech stack, plus + the tagged 0.1 release: PyPI publication, docs deployment, Docker + images, CI matrix, supervision artifacts, doctor preflight, all + the operational items that make the tool credible to external + users. Timed to arrive *with or slightly before* Paper C's + submission, producing a paper-plus-tool bundle that reviewers can + actually run. **Weeks of work, timing driven by Paper C's + submission window.** + +4. **Track 2 — Clinical platform (contingent).** Everything beyond + the open-source research tool — multi-tenancy, audit logging, + HTTP/API layer, clinician UI, clinical-system integrations. Not + sequenced; activates only if specific triggers fire (external + demand, multi-site ambition, funding mandate, publication + traction). Most of this is background thinking, not planned work. + The value of keeping it in this document is so that Phase 0 and + Phase 2 decisions don't accidentally foreclose Track 2 options. + +Phases 0 → 1 → 2 form a near-term sequence that culminates in a +paper-plus-release bundle. Track 2 sits outside that sequence and +does not gate any of it. + +## Contents + +- [Phase 0: C-enabling pipeline work](#phase-0-c-enabling-pipeline-work) + - [Procrustes preprocessing](#procrustes-preprocessing) + - [Gait cycle segmentation](#gait-cycle-segmentation) + - [Joint-angle DTW representation](#joint-angle-dtw-representation) + - [Provenance subobject](#provenance-subobject) + - [YAML-configurable analysis pipeline](#yaml-configurable-analysis-pipeline) + - [Schema migration for VideoPredictions](#schema-migration-for-videopredictions) +- [Phase 1: Clinical validation study (Paper C)](#phase-1-clinical-validation-study-paper-c) +- [Phase 2: Coordinated open-source release + Paper A](#phase-2-coordinated-open-source-release--paper-a) + - [Release definition](#release-definition) + - [Apple Silicon CI matrix](#apple-silicon-ci-matrix) + - [Mac hardware validation pass](#mac-hardware-validation-pass) + - [Retention and pruning](#retention-and-pruning) + - [neuropose doctor preflight](#neuropose-doctor-preflight) + - [Process supervision artifacts](#process-supervision-artifacts) + - [Structured logging option](#structured-logging-option) + - [Monitor authentication](#monitor-authentication) + - [Docker GPU image](#docker-gpu-image) + - [Dependency freshness automation](#dependency-freshness-automation) + - [Release workflow](#release-workflow) + - [Error-path test coverage expansion](#error-path-test-coverage-expansion) +- [Track 2: Clinical platform (contingent)](#track-2-clinical-platform-contingent) + - [Triggers to activate Track 2](#triggers-to-activate-track-2) + - [Multi-tenancy and identity](#multi-tenancy-and-identity) + - [Audit logging and compliance posture](#audit-logging-and-compliance-posture) + - [HTTP/API layer](#httpapi-layer) + - [Clinician-facing UI](#clinician-facing-ui) + - [Horizontal scaling](#horizontal-scaling) + - [Backup, replication, and data durability](#backup-replication-and-data-durability) + - [Clinical-system integrations](#clinical-system-integrations) + - [Deterministic inference mode](#deterministic-inference-mode) + - [Observability and SLOs](#observability-and-slos) + - [Supply-chain attestation and signed releases](#supply-chain-attestation-and-signed-releases) + - [Deployment orchestration](#deployment-orchestration) +- [Decisions to not prematurely foreclose](#decisions-to-not-prematurely-foreclose) + +--- + +## Phase 0: C-enabling pipeline work + +The six items below are prerequisites for Paper C. Until they are +landed, every analysis C would produce would be running on defaults +that `RESEARCH.md` §1 explicitly flags as provisional. Ship these +first, in any order that suits the implementer's cadence, and the +rest of the project can pick up with confidence that Phase 1 results +are trustworthy. + +### Procrustes preprocessing + +**Status:** Not implemented. `neuropose.analyzer.features` ships +`extract_joint_angles` and feature-statistics helpers; no alignment +step exists between pose sequences. + +**Why it matters for Paper C:** without alignment, DTW distance is +translation- and orientation-dependent. Two recordings of the same +subject from different camera positions produce different distances, +which is almost never what a clinician wants. Paper C's methods +section would need to apologize for this in print; cheaper to fix the +method than to defend it. + +**Scope:** + +- Add `procrustes_align(a: np.ndarray, b: np.ndarray, *, mode: + Literal["per_frame", "per_sequence"]) -> tuple[np.ndarray, + np.ndarray, AlignmentDiagnostics]` to `neuropose.analyzer.features`. + Implements the Kabsch algorithm (closed-form optimal rigid + transform). Per-frame aligns each frame of A to the corresponding + frame of B independently; per-sequence computes one transform over + the whole sequence. Both are useful — per-frame for fine-grained + matching, per-sequence for preserving within-trial dynamics. +- Return aligned arrays plus an `AlignmentDiagnostics` dataclass with + the fitted rotation magnitude and translation magnitude so + downstream code can flag suspiciously large transforms (usually a + sign of upstream annotation error). +- Expose as an opt-in `align: Literal["none", "procrustes_per_frame", + "procrustes_per_sequence"] = "none"` parameter on every DTW entry + point in `neuropose.analyzer.dtw`. Default `none` preserves current + behavior; Paper C's pipeline sets it to `procrustes_per_sequence`. +- Unit tests: construct a known rotation + translation between two + synthetic skeletons, verify alignment recovers it to within + floating-point precision; verify alignment of a sequence with its + own translated copy produces zero residual. + +**Non-scope:** + +- Non-rigid alignment (thin-plate splines, learned registration). Not + needed for skeleton-level comparison and would be a research + contribution on its own. + +**Open question:** should alignment also include optional scaling +(scaled-Procrustes / full Procrustes)? For cross-subject comparison +it almost certainly should. Default to scale-preserving and add a +`scale: bool = False` flag; Paper C can flip it on for cross-subject +figures. + +### Gait cycle segmentation + +**Status:** `segment_by_peaks` in `neuropose.analyzer.segment` +performs generic valley-to-valley segmentation on a supplied 1D +signal. There is no gait-specific wrapper that knows to look at the +heel's vertical coordinate. + +**Why it matters for Paper C:** clinical gait analysis wants to +compare *the 4th heel-strike of trial A* to *the 4th heel-strike of +trial B*, not *frame 120 of A vs frame 120 of B*. Per-cycle DTW is +the standard approach in the biomechanics literature (Sadeghi et al. +2000 and descendants); running full-trial DTW on gait is a choice +reviewers of Paper C would correctly flag as methodologically weak. + +**Scope:** + +- New `segment_gait_cycles(predictions: VideoPredictions, *, joint: + str = "rhee", axis: Literal["x", "y", "z"] = "y", min_cycle_seconds: + float = 0.4) -> Segmentation` in `neuropose.analyzer.segment`. +- Under the hood: extract the specified joint's coordinate along the + specified axis, apply `segment_by_peaks` with appropriate distance + and prominence thresholds (derived from `min_cycle_seconds` via + `predictions.metadata.fps`), return the resulting `Segmentation` + (the existing `neuropose.io.Segmentation` type) so downstream + tooling picks it up unchanged. +- Two-sided detection: run the same detection on the opposite heel + and return *both* per-side segmentations under named keys + (`left_heel_strikes`, `right_heel_strikes`). Clinical users will + want both. +- Allow the reference joint and axis to be configurable so trials + recorded with a different camera orientation (lateral vs frontal + vs oblique) can still be segmented without a code change. + +**Non-scope:** + +- HMM-based cycle detection, learned cycle detectors. Peak detection + on vertical coordinate is standard, well-understood, and the + method the biomechanics literature expects to see. +- Handling pathological gaits where heel-strikes are absent + (shuffling, walker-assisted). The function should degrade + gracefully (return a `Segmentation` with an empty list, not raise), + and Paper C's data-quality filtering handles the rest. + +**Open question:** should the function also emit a "confidence" per +cycle (prominence of the detected peak, regularity of spacing) that +Paper C can use to filter out low-quality detections? Cheap to add, +useful downstream. Recommend yes. + +### Joint-angle DTW representation + +**Status:** `dtw_all`, `dtw_per_joint`, and `dtw_relation` operate on +raw 3D coordinates or joint-pair displacements. `extract_joint_angles` +produces per-frame angle sequences but is not wired as a DTW input. + +**Why it matters for Paper C:** angle-space DTW is translation- and +rotation-invariant by construction, scale-invariant if normalized, +and directly interpretable in clinical terms ("knee flexion angle +during swing phase"). Paper C's headline figures almost certainly +use angle-space distances; raw coordinates would draw the obvious +reviewer question of why we aren't comparing the thing clinicians +actually measure. + +**Scope:** + +- Add `representation: Literal["coords", "angles", "relation"] = + "coords"` to every DTW entry point. The `coords` default preserves + existing behavior; `angles` runs `extract_joint_angles` on each + input first; `relation` is the existing `dtw_relation` path + expressed as a representation choice rather than a separate + function (leaving the `dtw_relation` name as a convenience wrapper + if preferred). +- Degenerate-vector handling: `extract_joint_angles` returns NaN for + degenerate (zero-length) vectors. The DTW path needs to decide how + to handle NaN — skip-and-interpolate, drop, or propagate to the + distance. Propagation is safest (makes the problem visible); + interpolation is what clinical users probably want day-to-day. + Default to propagation and expose `nan_policy: Literal["propagate", + "interpolate", "drop"]` for experimentation. +- Tests: synthetic pair with known angular difference, assert DTW in + angle-space recovers it independent of global rotation applied to + the input. + +**Non-scope:** + +- Quaternion or SO(3) rotation-space DTW. Interesting but requires a + rotation parameterization the current skeleton output does not + produce. +- Mixed-representation (position + angle concatenated, learned + embeddings). These are experiments Paper C might run; they don't + belong in Phase 0 infrastructure. + +### Provenance subobject + +**Status:** `PerformanceMetrics` captures `tensorflow_version`, +`active_device`, and `tensorflow_metal_active`. Model SHA is not +computed or propagated. `numpy_version` and `neuropose_version` are +not recorded. No first-class `Provenance` object. + +**Why it matters for Paper C:** reproducibility is the first +question a reviewer asks of a clinical-methods paper. The answer +needs to be "same model artifact, same pipeline config, same +versions, same seeds" — and all four need to be recorded on every +`results.json` that underlies a paper figure. Not having this means +either manually tracking it in a lab notebook (fragile, won't +survive personnel turnover) or running every experiment through a +pinned Docker image (expensive, doesn't capture runtime +non-determinism). The subobject is the cheap right answer. + +**Scope:** + +- New `Provenance` pydantic model in `neuropose.io` with fields: + `model_sha256: str`, `model_filename: str`, `tensorflow_version: + str`, `tensorflow_metal_version: str | None`, + `numpy_version: str`, `neuropose_version: str`, `python_version: + str`, `seed: int | None`, `deterministic: bool`, `analysis_config: + dict | None` (the YAML of the run if the pipeline was invoked via + `neuropose analyze --config`). +- Optional `provenance: Provenance | None = None` field on + `VideoPredictions`, `JobResults`, and `BenchmarkResult`. None-valued + on legacy files (enabled by schema migration — see below), populated + on every new write. +- `_model.py` hashes the downloaded tarball on first load (after the + existing SHA verification — the two checks use the same hash so + compute is amortized) and exposes the hash via a + `get_model_sha256()` method on the `Estimator`. `Interfacer._run_job_inner` + constructs the `Provenance` and attaches it to the output. +- Unit test: serialize → JSON → deserialize round-trip identity; + assert `model_sha256` matches the SHA recorded in + `neuropose._model`. + +**Non-scope:** + +- Cryptographic signatures on results.json. That's Phase 2 (sigstore + on release artifacts) or Track 2 (per-output signing) territory, + not Phase 0. +- Provenance on arbitrary intermediate products (numpy arrays, DTW + distance matrices). Top-level JSONs cover Paper C's needs; richer + intermediates can inherit from a hand-off if needed. + +**Open question:** does Paper C need *per-frame* provenance (which +frame was processed with which configuration) or just per-job +provenance? Per-job is enough for reproducibility; per-frame is only +useful if we want to mix configurations within a single job, which +has no current use case. + +### YAML-configurable analysis pipeline + +**Status:** `neuropose.cli`'s `analyze` subcommand is a stub that +raises `NotImplementedError`. Analysis operations are called +individually from Python, or via CLI flags on `segment` and +`benchmark`. No unified representation of "a complete analysis run." + +**Why it matters for Paper C:** the paper will run many experimental +configurations — alignment on/off, per-frame vs per-sequence, raw +coordinates vs joint angles, full-trial vs cycle-segmented DTW, +various distance metrics. Each experiment should be reproducible +from a single file that can be version-controlled, diffed, attached +to the `Provenance` object, and cited in the paper. A Python script +full of kwargs is the alternative, and it's exactly the alternative +the open-source community collectively decided against ten years ago. + +This item also resolves the "`neuropose analyze`: ship or remove" +question that was previously open: we are shipping `analyze`, just +specifically in a YAML-driven form. The stub that currently exists +becomes the real command in Phase 0. + +**Scope:** + +- `AnalysisConfig` pydantic model in `neuropose.analyzer` capturing + the full pipeline: input source (predictions file path), + preprocessing (`align`, `normalize`, `segment`), per-segment + analysis (DTW backend, representation, distance function, extra + kwargs), output (figures, statistics, distance matrices). +- Parseable from YAML via pydantic; validated on parse so typos in + field names fail early with a clear error. +- `neuropose analyze --config experiment.yaml [--output + results_042.json]` runs the pipeline end-to-end. The config YAML + is serialized into the resulting `Provenance.analysis_config`, so + the output file is self-describing. +- Ship three or four *example* configs under `examples/analysis/` + that exercise the full matrix of alignment × representation × + segmentation choices Paper C will care about. Double as integration + tests. + +**Non-scope:** + +- A DAG / workflow engine (Snakemake, Nextflow). A flat config is + enough for Paper C's needs; reach for a DAG tool only when + experiments have genuine inter-stage dependencies, which analysis + of a single video does not. +- Parallel sweep execution. Run multiple configs via a shell loop + for now (`for cfg in examples/analysis/*.yaml; do neuropose + analyze --config "$cfg" --output "out/$(basename "$cfg" .yaml).json"; done`). + A real sweep orchestrator is Track 2. + +**Open question:** should there be a `neuropose analyze compare + ` subcommand that runs both and +emits a diff figure? Useful for Paper C but not a gating feature — +post-Phase-0 addition if the need is clear. + +### Schema migration for VideoPredictions + +**Status:** `VideoPredictions` gained `segmentations: dict[str, +Segmentation] = Field(default_factory=dict)` during recent work. Old +JSON files without the field still load (pydantic default-factories +back-fill), but this is accidental rather than designed-in. + +**Why it matters for Paper C:** Paper C will produce analysis results +over the course of 6-12 months. During that window, Phase 0 work +itself will evolve — the `Provenance` object will gain fields, the +`AnalysisConfig` shape will stabilize, maybe the `Segmentation` schema +will extend. Without migration support, every schema change would +invalidate some portion of Paper C's already-generated data, forcing +either a freeze (drops velocity) or a full re-run (wastes compute). +Migration now is the cheap fix. + +**Scope:** + +- Add a `schema_version: int = 1` field to `VideoPredictions`, + `JobResults`, and `BenchmarkResult` (the three load-anywhere + top-level schemas). +- Write `migrate_video_predictions(payload: dict) -> dict` that + takes a raw JSON-loaded dict, dispatches on `schema_version`, and + returns a dict conformant with the current version. Default to 1 + when missing (existing files). +- Wire it into `load_video_predictions()` so the migration runs + before pydantic validation. Log at INFO on migration so users see + when files are being upgraded. +- When writing, always write the current version. + +**Non-scope:** + +- A general-purpose migration framework. A function that dispatches + on an integer is sufficient until we have three versions. +- In-place migration (writing back the upgraded file). Migrations + should run on read; write-back is a separate operator decision. + +--- + +## Phase 1: Clinical validation study (Paper C) + +Phase 1 is *Paper C itself* — the clinical-methods paper this project +exists to produce. The content belongs in the paper, in `RESEARCH.md`, +and in the analysis-config YAMLs under `examples/`, not here. This +section exists only to demarcate the phase and to capture the +engineering commitments that should (and should not) happen during it. + +**Engineering posture during Phase 1:** + +- **Phase 0 is frozen on entry.** Don't refactor the analyzer during + Phase 1; refactors invalidate earlier experiments. If a Phase 0 + shortcoming surfaces during paper-writing, log it in `RESEARCH.md` + and revisit after submission. +- **Phase 2 work is welcome as background.** Writing a launchd plist, + wiring up Dependabot, tightening error-path tests — all of this is + ideal filler work during the experimental-design and writing + phases of Paper C. It consumes different energy than research work + does, and the tool is in better shape on submission day as a + result. +- **`RESEARCH.md` gets the bulk of the updates.** Methods decisions, + reading-list expansions, reviewer-response notes all live there, + not here. +- **Do add engineering-side notes here** when a Paper C experiment + reveals a piece of missing tooling that's worth a Phase 2 item + (for example: "we needed batch-analysis across 200 trials and hit + this, so Phase 2 should include ..."). Phase 1 is the best + possible source of prioritization signal for what Phase 2 is + actually worth. + +**Prerequisite outside this document:** a MoCap-data-access +conversation with Dr. Shu. Nothing in Phase 1 can start until that +conversation has resolved. `RESEARCH.md` §3 flags this as the +gating question for fine-tuning; it is equally the gating question +for validation. + +--- + +## Phase 2: Coordinated open-source release + Paper A + +Phase 2 is the release. Its content is exactly the items listed here +— the engineering work to take the Phase-0-plus-Phase-1 codebase to a +state where an outside researcher can pick it up, install it, run it, +verify its claims, and cite it. It runs concurrently with the tail +end of Phase 1 (see posture notes above) and culminates in a +coordinated drop: tag → PyPI → Pages → arXiv / JOSS submission for +Paper A → reference in Paper C's Code Availability section. + +### Release definition + +Before enumerating the remaining work, define what "released" means. +A release candidate should satisfy all of the following: + +1. **Installable on a blank machine.** `pip install neuropose` or + `uv pip install neuropose` works on both Linux x86_64 and Apple + Silicon Mac, with no manual steps beyond Python 3.11. +2. **Runnable without the author in the room.** The `docs/` site is + published somewhere persistent (GitHub Pages, Cloudflare Pages), + the getting-started walkthrough actually works end-to-end, and + the MeTRAbs model downloads and verifies on first run. +3. **Verifiable by a reviewer.** CI runs on every push, covers both + Linux and macOS, and a PR from a stranger could be meaningfully + reviewed without access to the research Mac. +4. **Honest about its limits.** Every surface the release advertises + is either exercised in CI or clearly marked experimental. No + false promises in the README or CLI help text. (The `analyze` + stub that motivated this item pre-Phase-0 is now real per Phase + 0's YAML pipeline, so "ship or remove" is no longer open.) +5. **Versioned.** A git tag exists, `__version__` matches, and + `CHANGELOG.md` has a real release section, not just `[Unreleased]`. +6. **Bundled.** Paper A (tech-stack writeup) and Paper C (clinical + validation) cite the release tag, and the release notes cite + them. The three artifacts arrive together; reviewers of either + paper can find and run the code. + +Items below are the gaps between the end-of-Phase-0 state and that +definition. + +### Apple Silicon CI matrix + +**Status:** `RESEARCH.md` lists this as an open next step; no +`macos-14` entry in `.github/workflows/ci.yml`. + +**Why it matters for release:** every claim of "Apple Silicon +support" is currently "by construction" — the TF 2.16+ floor ships +`darwin/arm64` wheels, the MeTRAbs SavedModel has zero custom ops, and +therefore it should work. It has not been empirically confirmed on +real hardware in an automated way. For a public release, we either +verify in CI or we stop claiming Mac support in the README. + +**Scope:** + +- Add a `macos-14` matrix entry to the `test` job (lint and typecheck + stay single-platform, they're platform-independent). +- Exclude `slow` markers on macOS so we don't pay the 2 GB model + download twice per run. +- Accept that the first green macOS run may require two or three + hotfixes — path case sensitivity, `multiprocessing` spawn vs fork, + shared library load order — and budget a day for that. +- Do **not** add a Metal runner. GitHub's `macos-14` runners don't + expose the GPU to TensorFlow in a useful way, and the `[metal]` + extra's numerical verification is a separate task that needs real + M-series silicon we control. + +**Sketch:** + +```yaml +test: + strategy: + fail-fast: false + matrix: + os: [ubuntu-latest, macos-14] + runs-on: ${{ matrix.os }} +``` + +Everything else in the job stays the same; `uv` works identically on +both platforms. + +### Mac hardware validation pass + +**Status:** Unexercised. The Shu Lab research Mac (`100.64.15.110`) is +available; we have an rsync script but no cron job, no automated +smoke check, no numerical-divergence report against the Linux +baseline. + +**Why it matters for release:** CI on GitHub's `macos-14` runners +validates that the wheels install and the tests pass on Apple +Silicon. It does not validate that the real MeTRAbs model loads, that +inference runs, or that `poses3d` on the Mac matches `poses3d` on +Linux within a sane tolerance. Those are different questions, and +answering them against a throwaway runner each time would be wasteful +and unreliable. + +A minimum version of this check — "does `detect_poses` produce +output on the research Mac at all?" — should happen during Phase 0 +regardless, because Paper C will likely run on the same hardware and +a silent numerical divergence there would invalidate the paper's +results. The scope below is the full, release-grade version. + +**Scope:** + +- Run `neuropose benchmark --compare-cpu` against a reference clip on + the research Mac. Capture the resulting `BenchmarkResult` JSON. +- Commit the JSON as `benchmarks/reference/mac_m3_ultra_cpu_v0_1.json` + (a tracked file, not gitignored — this is the reference numerics + we'll compare against going forward). +- Separately, run the `[metal]` path and diff. Record in + `RESEARCH.md` whether divergence is within the ~1e-2 mm budget the + research notes propose, or whether the Metal path is in the "use at + your own risk" column. +- Document the findings as a new section in `RESEARCH.md` ("Apple + Silicon verification, 2026-0X") and close the corresponding + open-question entry. + +**Open question:** should the reference JSON become a test input +(slow-marked integration test that re-runs benchmark on a developer's +machine and asserts divergence from the committed reference), or just +documentation? The former catches regressions automatically at the +cost of a 2 GB model download in the slow job; the latter is cheaper +but easier to ignore. + +### Retention and pruning + +**Status:** `out/` and `failed/` grow forever. No retention config. +No `neuropose prune` command. + +**Why it matters for release:** a research Mac running the daemon +unattended for months will fill its disk. The first support request +will be "the daemon just stopped working" and the answer will be "you +ran out of disk." We can solve this once now, or a hundred times +later. + +**Scope:** + +- Add a `retention_days: int | None = None` setting (default None = + disabled, preserving current behavior). +- When set, the daemon checks on each poll whether any job in + `out/` or `failed/` is older than the threshold and removes it. The + corresponding `status.json` entry transitions to a new `PRUNED` + state (keeping the audit trail) or is removed entirely (keeping the + status file small) — pick one and document. +- Ship a `neuropose prune [--older-than N] [--dry-run]` one-shot + command for operators who want manual control. +- Document in `docs/deployment.md` with a recommended default (30 + days feels right for benchmark/iteration workflows; clinical + deployments would be legal-driven and much longer). + +**Open question:** should pruned jobs' `status.json` entries be +preserved as tombstones (so a user asking "where did job X go?" can +see "pruned 2026-05-01") or removed entirely? Tombstones are more +user-friendly; removal keeps the status file bounded. Default to +tombstones since the status file bound is only a problem at a scale +the 0.1 release won't hit. + +### neuropose doctor preflight + +**Status:** Not implemented. + +**Why it matters for release:** pydantic-settings validates the +*schema* of `Settings` (is `device` a valid string, is +`poll_interval_seconds` positive). It does not validate the +*environment* — is `data_dir` writable, is the lock file acquirable, +is `model_cache_dir` on the same filesystem as `data_dir` (so +`os.rename` works atomically), is the configured TF device actually +available. Each of those is a runtime failure mode that shows up with +an ugly traceback ten seconds after `neuropose watch` starts, and +every one is cheaply detectable at startup. + +**Scope:** + +- New subcommand `neuropose doctor` that runs a battery of + preflight checks and prints a pass/fail table. +- Checks to include: `data_dir` exists and is writable; lock file + acquirable (with clean release); all three subdirectories + (`in/out/failed`) writable; `model_cache_dir` writable and on the + same filesystem as `data_dir`; TF is importable; configured + `device` is in `tf.config.list_physical_devices()`; + `tensorflow-metal` either absent or installed with a version that + advertises support for the installed TF; XDG envvars are sane; + Python version matches `pyproject.toml` floor. +- Exit code 0 if all checks pass, 1 if any warning, 2 if any fatal + failure. +- The daemon's `run()` entry point calls the same underlying + preflight function before entering the poll loop, so + `watch`-without-doctor still gets the benefit. + +**Non-scope:** + +- Do not check for network access to the MeTRAbs download host. + Network-dependent checks make CI flaky and don't match the offline + caching behavior of real operators. + +### Process supervision artifacts + +**Status:** `docs/deployment.md` documents a systemd user unit as +text in prose. No file in `scripts/` that a user can actually copy. +No macOS launchd plist at all. + +**Why it matters for release:** copy-paste from a docs page into a +`.service` file works, but it's friction. An open-source project with +"here is the file, here is where it goes, here is the enable command" +ships deployments faster. + +**Scope:** + +- Ship `scripts/systemd/neuropose.service` as a file with `%h` + placeholders and a short install README. +- Ship `scripts/launchd/org.levineuwirth.neuropose.plist` as a file + with an install README. (Consider making the plist label match the + reverse-DNS of whoever is hosting — either the lab's or + `org.neuropose.daemon` for a vendor-neutral identity.) +- Optional: a `scripts/install_service.sh` that detects the platform + and runs the right install command. Probably not worth the + complexity; a five-line README section per platform is fine. + +**Non-scope:** + +- Do not write installers for init systems we do not personally run + (upstart, sysvinit, runit). If someone needs those, the systemd + unit gives them enough of a template. + +### Structured logging option + +**Status:** Everything logs to stderr via `logging.basicConfig` +with a human-readable formatter. + +**Why it matters for release:** the current format is correct for +interactive use. For any consumer that wants to feed the daemon's +output into Loki, Splunk, Grafana, Datadog, or even `jq`-based +aggregation, JSON-per-line would eliminate a parsing step. This is +a near-free feature if added now and a disruptive formatting change +if added later. It is also a prerequisite for any Track 2 +audit-logging work, so building it now keeps Track 2 options open at +near-zero cost. + +**Scope:** + +- Add a `--log-format={human,json}` global CLI option defaulting to + `human`. +- Implement the `json` variant as a formatter that emits + `{"ts": ..., "level": ..., "logger": ..., "message": ..., ...}` per + line with no log-line wrapping. +- Wire it through `_configure_logging()` so every subcommand benefits + identically. + +**Open question:** do we also want log correlation IDs per job? +That's a bigger change (pushing a context var through the +Interfacer's call stack) and probably Track 2 — skip for 0.1. + +### Monitor authentication + +**Status:** The monitor binds to `127.0.0.1:8765` by default. No +auth, no tokens. `--host 0.0.0.0` works but has a comment warning the +operator to think. + +**Why it matters for release:** loopback-only is a reasonable +default, but the monitor is specifically marketed as the thing +collaborators can watch. "Collaborator" implies a browser somewhere +other than the daemon host. The "correct" answer (TLS, real auth) is +too expensive for 0.1; the "wrong but acceptable" answer (no auth, so +anyone who can reach the port sees everything) is what we have now. +There's a middle ground. + +**Scope:** + +- Add an optional `monitor_token: str | None = None` setting. +- When set, every request to `/` and `/status.json` must carry + `?token=` in the query string or `X-Status-Token` in the + header. No token → 401. +- `neuropose serve` prints a URL including the token on startup, so + operators can copy-paste it. If `monitor_token` is unset, behavior + is unchanged. +- `--host 0.0.0.0` emits a stderr warning if `monitor_token` is unset + — don't block it, just flag it. + +**Non-scope:** + +- TLS. Use a reverse proxy (Caddy, nginx, `ssh -L`) for any + internet-facing exposure. The monitor is not the right place to + terminate TLS. +- Multi-user auth, session cookies, anything with a database. That's + Track 2. + +### Docker GPU image + +**Status:** `Dockerfile` exists (CPU-only). `Dockerfile.gpu` +mentioned in CHANGELOG as planned. + +**Why it matters for release:** a single-file CUDA deployment story +reduces "can I run this on our lab server?" from a 45-minute dance +with conda and CUDA versions to one `docker run`. For Linux GPU +users this is the friction difference between trying the project and +bouncing. + +**Scope:** + +- Write `Dockerfile.gpu` on top of `nvidia/cuda:12.x-runtime-ubuntu22.04` + (pick the version TF 2.18 actually supports — check the + `tensorflow-gpu` compat matrix, not just "latest"). +- Multi-stage: build stage has `uv` and builds the venv; final stage + just copies the venv and sets entrypoints. +- Add a `docker-build.yml` CI workflow that builds both images on + every push to main and publishes as `ghcr.io/neuwirth/neuropose:cpu` + and `:gpu` (or wherever the project ends up hosted). +- Document in `docs/deployment.md` with a `docker run --gpus all` + example. + +**Non-scope:** + +- A `tensorflow-metal` Docker image. Mac can't virtualize Metal, so + there's no point. + +### Dependency freshness automation + +**Status:** No Dependabot, no Renovate. Everything floats until +somebody notices. The recent TF cap tightening (`<2.19`) was caught +manually because a user happened to ask; a scheduled bot would have +flagged it weeks earlier. + +**Why it matters for release:** security CVEs on transitive +dependencies land every few weeks. Without automation, they get +discovered by a downstream user trying to install into an audited +environment. With automation, they become a PR you either merge or +explicitly decline. + +**Scope:** + +- Add `.github/dependabot.yml` with groups: `python-prod`, + `python-dev`, `github-actions`. Weekly schedule. Ignore `tensorflow` + updates until manually cleared (the `tensorflow-metal` constraint + means auto-bumping TF is destructive). +- Alternative: Renovate via `renovate.json`. Renovate has better + grouping and scheduling, Dependabot is simpler and needs no setup + on GitHub. For an open-source Brown-lab project, Dependabot is + enough. +- Add `uv lock --upgrade-package ` to the dev playbook in + `docs/development.md` so PR authors know how to re-lock. + +### Release workflow + +**Status:** `[project.scripts]` is wired for `pip install`, but no +tag-triggered publishing pipeline. `.github/workflows/docs.yml` +uploads the built docs as a 14-day artifact, not to Pages. + +**Why it matters for release:** "release" without a repeatable +publishing flow is a synonym for "one-off person runs hatch build on +their laptop at 11pm before the paper deadline." That is not a +release. + +**Scope:** + +- `.github/workflows/release.yml` triggered on version tags + (`v[0-9]+.[0-9]+.[0-9]+`). Steps: check version matches + `__version__`; build with `hatch build`; publish to PyPI via + trusted publisher (no long-lived token); create GitHub release with + changelog excerpt. +- Flip `docs.yml` to deploy the `site/` output to GitHub Pages on + every push to `main` once the repo is public. Pin the Pages URL in + the README and in `site_url` in `mkdocs.yml` (already points at + `levineuwirth.github.io`, but verify). +- Sign tags with GPG; document the key fingerprint in `SECURITY.md` + (which does not yet exist; create it). +- Consider wiring sigstore signing at the same time — see Track 2 + supply-chain section. Free after the initial setup and buys + everything Track 2 would want without committing to the rest of + that track. + +**Open question:** do we publish under `neuropose`, `brown-neuropose`, +or something else on PyPI? Whichever name, squat it before the paper +drops — waiting means risking namesquatter abuse. + +### Error-path test coverage expansion + +**Status:** Happy paths and a handful of input-validation errors +covered. Not covered: disk full mid-write, corrupt video mid-decode, +OOM during inference, fcntl.flock on NFS (no-op on some kernels), +truncated zip archives, permission denied on data_dir subdirectories. + +**Why it matters for release:** shipping a tool where "happy path +works" is different from shipping a tool where "when it fails, it +fails predictably." For a clinical research pipeline where a crash +mid-job quarantines valuable recording data, fault tolerance is a +feature. + +**Scope:** + +- Systematic pass: for each module, write a `test__failure_modes.py` + enumerating the specific exception classes that can escape and the + corresponding test case that triggers each one. Use `pytest.raises` + with the exact expected exception class. +- Hardest cases use fixtures that monkeypatch system calls + (`os.write` raises OSError(ENOSPC), `cv2.VideoCapture.read` returns + `False, None` partway through, `fcntl.flock` raises OSError(EBADF)). +- Aim: every user-facing error message in the codebase has a test + that proves it's reachable. + +**Non-scope:** + +- Chaos-engineering frameworks. `monkeypatch` is enough. +- Covering unrecoverable errors like SIGKILL of the daemon mid-frame. + That's the recovery-on-startup test, which already exists. + +--- + +## Track 2: Clinical platform (contingent) + +Track 2 is everything beyond the open-source research tool — +multi-tenancy, audit logging, HTTP/API layer, clinician UI, +clinical-system integrations, the works. None of it is sequenced +with Phases 0–2; all of it is gated on specific triggers that don't +exist yet. + +### Triggers to activate Track 2 + +Do not start Track 2 work until at least one of the following is +true: + +1. **External demand.** Another clinical group has asked for a + deployment they can run independently. Not a casual "interesting + project" — a specific ask with a specific cohort and a specific + timeline. +2. **Multi-site ambition.** The Shu Lab decides to run NeuroPose + across more than one site within Brown-affiliated clinical + systems, and the single-host assumption stops working. +3. **Funding mandate.** A grant or contract specifies outputs that + the Phase 0-1-2 deliverables cannot meet (e.g. "produce a + HIPAA-compliant deployment," "integrate with the EHR"). +4. **Publication traction.** Papers A and C get engagement that + translates into demand for a hosted version. Clinical-methods + papers occasionally do. If enough unsolicited inquiries land, + Track 2 becomes worth the investment. + +Before at least one of these triggers: everything below is +background thinking, not planned work. *Do not refactor Phase 0 or +Phase 2 code to make Track 2 easier.* Every such refactor is a bet +on a future that may not arrive. + +### Multi-tenancy and identity + +**What it would require:** + +- A concept of "user" distinct from "OS user." Today `Settings.data_dir` + is one directory per OS user; multi-tenancy means one `data_dir` + serving many logical tenants with enforced isolation. +- Per-tenant namespacing in `in/`, `out/`, `failed/`, and + `status.json`. Cleanest is one subdirectory per tenant with the + same four-directory layout; the daemon's discovery logic becomes a + two-level scan. +- Authentication on the control plane. Passing tenant identity as a + command-line arg is fine for a research prototype; a real + deployment needs OAuth/OIDC or SAML with the institution's IdP + (Brown CAS, epic Auth, whatever the target site uses). +- Authorization model: at minimum, "tenant A cannot see tenant B's + jobs." For clinical deployments, probably also role-based (clinician + / PI / admin / auditor). + +**Cheapest path forward if a trigger fires:** fork the data-directory +layout into `$data_dir//{in,out,failed,status.json}`, +teach the daemon to iterate tenants in its poll loop, add a +`--tenant` flag to the CLI. That's enough for an invitation-only +deployment where tenants are identified by opaque string and issued +out-of-band. + +**Expensive path:** anything involving an identity provider. Don't +go there without a real operator committing to the deployment. + +### Audit logging and compliance posture + +**What it would require:** + +- Append-only log of every data access, write, and configuration + change, with actor identity and timestamp. Separate from the + application log (which rotates). +- Logs streamed to a write-once sink (S3 with object-lock, + immutable journal) so a compromised host can't rewrite the + trail. +- Legal review: what exactly does HIPAA require of this tool? What + about institutional IRB? The answer will differ across sites and + the project cannot prescribe it — but the *capability* to generate + the required logs needs to be built in. +- Retention policy wired to the audit log, not just application + state. Pruning job results is different from pruning audit records. + +**Technical prerequisite:** structured logging from Phase 2 (which +is a cheap add and is scheduled anyway). Without JSON-per-line logs, +audit extraction is a grep-and-pray regex problem. + +### HTTP/API layer + +**What it would require:** + +- Today the control plane is "write files to `in/`." For a + non-filesystem-native consumer (a hosted web UI, a batch scheduler, + a Jupyter kernel in a different container), an HTTP API is the + right abstraction. +- FastAPI or Litestar on top of the existing ingest/interfacer/io + modules. The daemon becomes a long-running process that serves + requests *and* processes the input directory; or the daemon stays + headless and the HTTP layer is a separate process talking via the + same filesystem contract. +- OpenAPI schema published as part of the release so client code can + be generated. + +**Non-obvious pitfall:** the daemon's fcntl-based single-instance +lock assumes one writer. If the HTTP layer is a separate process, it +needs to go through the same ingest API, not directly into `in/`. +That's an easy discipline to establish if designed in from day one, +a painful refactor later. + +**Cheap Phase 0/2 precaution:** keep `neuropose.ingest` and +`neuropose.interfacer` API-stable as Python modules. If a future +HTTP layer imports them, we don't want to break the import. + +### Clinician-facing UI + +**What it would require:** + +- More than the `neuropose serve` dashboard — an actual web + application with clinician-facing views: patient list, session + list, session-level pose visualization, comparison against + reference motion, exportable reports. +- Probably React + TypeScript on the frontend, consuming the HTTP + API above. Backend-rendered templates would be faster to build but + a worse fit for the per-session interaction model clinicians + expect. +- WebGL or Three.js for 3D pose playback. The `neuropose.visualize` + module is a matplotlib-based still-frame tool; rebuilding it for + interactive 3D is a weeks-to-months project on its own. +- Accessibility: clinician environments include keyboard-only users, + users on institutional IE holdovers (yes, still), users with + screen readers. A research-grade UI ignores this; a clinical-grade + one cannot. + +**Scope is enormous.** This is the single largest piece of Track 2 +and would likely dwarf all other Track 2 work combined. Would not +start without dedicated frontend engineering effort. + +### Horizontal scaling + +**What it would require:** + +- A message broker (Redis Streams, RabbitMQ, or NATS) in place of the + filesystem poll. Each job becomes a broker message; multiple + worker processes consume and process in parallel. +- Shared storage for inputs and outputs (S3, MinIO, NFS). The + "job_name is a directory" contract generalizes to "job_name is an + object prefix." +- Per-worker GPU affinity for the multi-GPU case; worker auto-sizing + based on queue depth. +- Distributed lock for the leader-only work (status file writes, + retention enforcement). + +**Upgrade path that minimizes pain:** the current single-process +daemon is equivalent to the "one worker" case of a horizontal +deployment. If the job object in `neuropose.io` stays the source of +truth (not the filesystem layout), the transition is backend-swap, +not architectural surgery. Keep that option open by treating the +filesystem as an implementation detail of `Interfacer`, not a public +contract. + +### Backup, replication, and data durability + +**What it would require:** + +- Outputs (`out//results.json`) currently live on one disk on + one host. For clinical data this is insufficient durability. +- Replication target: another host (hot standby), object storage + (warm archive), or both. The `out/` directory is the canonical + store; replicating it periodically is a scriptable cron job today. +- Proper replication: as writes happen, not as a cron. Either a + daemon-side hook that PUTs to S3 immediately after each + `save_job_results`, or a sidecar process watching the filesystem + with `inotify`/`fswatch`. +- Restore story: how do we restore `out/` from backup without + breaking `status.json` (which refers to job names by convention)? + Test this annually. + +**Minimum viable backup for Phase 2:** add a `scripts/backup.sh` +that rsyncs `$data_dir/out/` to a configurable destination. Not a +feature; a paving-the-path-for-operators artifact. + +### Clinical-system integrations + +**What it would require:** + +- **DICOM** if videos are stored as DICOM instances rather than + MP4. Clinical motion-analysis devices increasingly output DICOM + video; reading DICOM means `pydicom` + some decoding logic. +- **FHIR** for patient metadata. If NeuroPose is to accept a + patient ID and attach it to a session, that ID probably comes + from a FHIR Patient resource. Means speaking FHIR to the hospital's + FHIR endpoint (Epic, Cerner). +- **Redcap** integration for clinical-research cohorts (the Brown + ecosystem uses it heavily). An export script that pulls subject + metadata from a RedCap project and lays it into the ingest + directory is cheap and valuable. + +**Order of likely need:** RedCap first (easy, valuable, Brown-local), +then DICOM (depends on what the recording device outputs), then +FHIR (only if we're pulling from an EHR, which we probably aren't +for research). + +### Deterministic inference mode + +**What it would require:** + +- Phase 0's `Provenance` object already captures model SHA, TF + version, NumPy version, and a seed field. The missing piece for + strict reproducibility is forcing TensorFlow itself to behave + deterministically — + `tf.config.experimental.enable_op_determinism()` plus seeding all + of `random`, `numpy.random`, and `tf.random`. +- A `deterministic: bool = False` setting on `Settings` that flips + the above. Default off, because deterministic mode costs a + meaningful fraction of throughput on GPUs and isn't free on CPUs + either. Clinical deployments would turn it on; benchmark runs + would turn it off. +- A `Provenance.deterministic` boolean field is already in the Phase + 0 scope; this item closes the loop by actually making that + boolean mean something. + +**Cheap Phase 2 precaution:** wire the setting in Phase 2 even if we +don't flip it on by default. Future Track 2 deployments can flip it +without a code change. + +### Observability and SLOs + +**What it would require:** + +- Prometheus metrics endpoint (separate port from the monitor, no + auth needed on metrics, loopback or behind a scraper only). +- Counters: jobs_processed, jobs_failed, frames_processed, bytes_read, + bytes_written. Histograms: per-frame latency, per-job latency, + per-video latency. Gauges: queue depth, active job count. +- Tracing: OpenTelemetry instrumentation on job_process, + detect_poses, save_job_results. Again, the interesting spans are + the long ones, so trace-sampling at 100% is usually fine until + throughput matters. +- Defined SLOs: "99% of jobs complete within 10× video duration," + "95% of monitor requests return in under 100 ms," etc. + SLO definitions go into a `docs/slos.md`; burn-rate alerting is + the operational half. + +**Order-of-magnitude** dependency: none of this is useful without +Track 2 demand. A single-user research Mac doesn't have SLOs. + +### Supply-chain attestation and signed releases + +**What it would require:** + +- SBOM generation on every release (CycloneDX or SPDX format, + attached to the GitHub release and published alongside the wheel). +- Signed releases: sigstore / cosign signatures on the wheel, the + Docker images, and the source tarball. GitHub's OIDC + + sigstore makes this a ten-line workflow once. For a clinical tool, + a reviewer being able to verify "this wheel is the one GitHub + Actions produced from this commit" is non-negotiable. +- Reproducible builds: same source → same wheel hash. Python wheels + are usually reproducible with `SOURCE_DATE_EPOCH` set and `.pyc` + exclusion; document the exact command. +- Provenance attestations (SLSA level 2 or 3) for the CI pipeline. + GitHub's `attestations/build-provenance` action does this. + +**Cheapest Phase 2 precaution:** wire sigstore signing into the +release workflow when it's first built (see Phase 2 release workflow +section). Free after the initial setup. + +### Deployment orchestration + +**What it would require:** + +- Kubernetes manifests (Helm chart, probably). Pod specs for the + daemon, the monitor, the HTTP API. Separate deployments so they + can scale independently. +- Terraform or Pulumi for the underlying infrastructure: GPU + node pool, object storage, IAM, TLS termination. Site-dependent; + Brown runs primarily on-prem with some AWS — the IaC would need + to target both. +- Secrets management: Vault, AWS Secrets Manager, or K8s + Secrets + External Secrets Operator. The monitor token, the + broker credentials, the object-storage keys all need to stop being + env vars in a `.service` file. + +**Strong recommendation:** do not write any of this until there is +a specific deployment with specific operators. Generic K8s manifests +written without a target are a solution in search of a problem, and +they age fast. + +--- + +## Decisions to not prematurely foreclose + +A short list of choices we should avoid making in Phase 0 or Phase 2 +that would make Track 2 more expensive later: + +1. **Keep `neuropose.ingest` and `neuropose.interfacer` API-stable + as Python modules.** A future HTTP layer should be able to import + them. Avoid adding `@staticmethod` decorators that hide internal + state; avoid coupling to global config. +2. **Keep the filesystem layout reversible.** Anything in + `$data_dir` that is not a user artifact should be treated as + internal. If Track 2 wants to replace the filesystem with an + object store, the daemon's only file I/O should be via + `neuropose.io` helpers — no raw opens scattered through the code. +3. **Keep `VideoPredictions.provenance` extensible.** The Phase 0 + `Provenance` model should be a pydantic model so fields can be + added backward-compatibly. Don't pack provenance into free-form + strings or nested dicts that require bespoke parsing. +4. **Keep the CLI subcommands orthogonal.** Do not add subcommands + that wrap multiple other subcommands for convenience; that + creates API shape we'd regret if the right composition layer + later is HTTP, not shell. +5. **Keep model loading behind `neuropose._model`.** A future + self-hosted model registry, signed-artifact verification, or + multi-model switching should be a change in one file, not a + refactor across the estimator. +6. **Keep `Settings` the single source of truth.** No `os.environ` + reads outside pydantic-settings; no sprinkled `Path.home()` + calls. Track 2 almost certainly overrides configuration from + a secret store, and if that override has one place to hook in, + it's easy. +7. **Keep status-file schema owned by pydantic, not hand-written + JSON.** Track 2 multi-tenancy means indexing into the status + file by tenant; a pydantic model refactor is cheap, a + hand-written dict refactor is not. +8. **Keep the `AnalysisConfig` shape additive.** The Phase 0 YAML + schema will evolve through Phase 1 as Paper C's experiments + surface needs. Additions are free (new optional fields); + renames and removals invalidate prior experiments. Pydantic's + `extra="forbid"` catches typos at parse time while still + allowing additive extension. + +These are cheap-now / expensive-later items. Every other Track 2 +decision can wait for a real trigger.