diff --git a/.gitignore b/.gitignore
index 0c58b27..216e017 100644
--- a/.gitignore
+++ b/.gitignore
@@ -70,6 +70,16 @@ Thumbs.db
 # --- Docs site build -------------------------------------------------------
 site/
 
+# --- Ideation / lab-notebook docs ------------------------------------------
+# Living R&D notes and engineering roadmaps. Kept locally so they can
+# evolve freely with in-progress thinking, pre-meeting drafts, and
+# speculative directions that don't belong in the public repo. Anything
+# under docs/research/ is treated the same way — personal / lab-internal
+# working artifacts, not published docs.
+/RESEARCH.md
+/TECHNICAL.md
+/docs/research/
+
 # --- Data and model weights (policy-enforced) ------------------------------
 # Runtime job directories, subject data, and downloaded model caches must
 # never be committed. The default runtime location is under $XDG_DATA_HOME
diff --git a/RESEARCH.md b/RESEARCH.md
deleted file mode 100644
index c040c4c..0000000
--- a/RESEARCH.md
+++ /dev/null
@@ -1,784 +0,0 @@
-# NeuroPose Research and Ideation Notes
-
-A living R&D log for open design questions, speculative directions, and
-planned experiments that are larger in scope than individual commits.
-This is **not** user-facing documentation — items in here are
-*candidates* for future work, and inclusion does not imply commitment.
-
-## How to use this document
-
-- Add a section when you start thinking about a new area of investigation.
-- Each section should end with an **Open questions** or **Next steps**
-  block so it's obvious to a future you (or a new contributor) what the
-  active threads are.
-- When something in here is decided and implemented, move it to the
-  relevant place in `docs/` or in the code itself and leave a short
-  pointer behind ("*See `docs/architecture.md` for the resolved design.*").
-- Consider the audience: yourself, Dr. Shu, David, Praneeth, and future
-  contributors. Assume they know pose estimation at a grad-student level
-  but may not have followed every prior conversation.
-
-## Contents
-
-- [DTW methodology](#dtw-methodology)
-- [TensorFlow version compatibility](#tensorflow-version-compatibility)
-- [MeTRAbs hosting and extensibility](#metrabs-hosting-and-extensibility)
-
----
-
-## DTW methodology
-
-### Current implementation (v0.1, commit 10)
-
-`neuropose.analyzer.dtw` ships three entry points, all built on top of
-[`fastdtw`](https://github.com/slaypni/fastdtw) with
-`scipy.spatial.distance.euclidean` as the point-distance function:
-
-- **`dtw_all(a, b)`** — single DTW on flattened `(frames, joints × 3)`
-  vectors. One scalar distance for the whole sequence.
-- **`dtw_per_joint(a, b)`** — one DTW call per joint, returning a list
-  of per-joint distances and warping paths. Preserves per-joint
-  temporal alignment at J× the cost.
-- **`dtw_relation(a, b, joint_i, joint_j)`** — DTW on the per-frame
-  displacement vector between two specific joints. The intent here is
-  to capture "how does the relationship between these two joints change
-  over time", which is translation-invariant and so immune to raw
-  camera-frame changes.
-
-These three correspond directly to the three helpers that existed
-(broken) in the previous prototype's `analyzer.py`, ported forward with
-bug fixes, types, and tests. **The port was mechanical — not a
-methodological choice.** We inherited the FastDTW + Euclidean defaults
-without validating them against the clinical research use cases, and
-that validation is overdue.
-
-### Known limitations of the v0.1 approach
-
-#### FastDTW is an approximation, not exact DTW
-
-[FastDTW](https://cs.fit.edu/~pkc/papers/tdm04.pdf) is a multi-scale
-approximation that runs in linear time by recursively refining a coarse
-alignment. For the radius-based implementation in
-`slaypni/fastdtw`, the distance is not guaranteed to match exact DTW,
-and in pathological cases the error can be significant. For a research
-codebase where the DTW distance is going to show up in a figure, that
-matters.
-
-**Candidate exact alternatives** (all pip-installable):
-
-- [`dtaidistance`](https://github.com/wannesm/dtaidistance) — C-based,
-  supports both exact DTW and a `fast=True` approximation; also
-  supports shape-DTW and various constraint bands. Actively maintained,
-  and the underlying algorithms match the textbook.
-- [`tslearn`](https://tslearn.readthedocs.io/) — ML-flavored toolkit
-  with exact DTW, soft-DTW (differentiable), Sakoe-Chiba banding, and
-  kernel-DTW. Good fit if we ever want to feed DTW distances into an
-  sklearn/PyTorch pipeline.
-- [`cdtw`](https://github.com/statefb/dtw-python) / `dtw-python` —
-  Python port of the R `dtw` package; exhaustive options for windowing,
-  step patterns, and open-ended alignment. Less friendly API but the
-  most rigorously documented.
-
-#### Euclidean is a choice, not a default
-
-Treating `(x, y, z)` joint positions as a point in R³ and taking
-Euclidean distances implicitly assumes the three axes are commensurable
-in the same units, which is fine for MeTRAbs (mm) but throws away prior
-knowledge about human motion. Alternatives worth considering:
-
-- **Angular distance on joint angles.** Compute joint angles per frame
-  (`extract_joint_angles` already exists) and run DTW on the angle
-  sequences rather than raw coordinates. Translation- and
-  scale-invariant by construction; well-matched to clinical metrics
-  like knee flexion angle.
-- **Geodesic distance on SO(3)** for local joint rotations. Requires a
-  skeleton-rooted rotation parameterization; more work to set up but
-  the right metric for "how different are these two poses?" in a
-  biomechanics sense.
-- **Mahalanobis distance** against a learned pose prior. This is the
-  "machine learning" answer — fit a covariance to a reference corpus
-  (normal gait from a healthy cohort), then measure distances in the
-  whitened space. Requires enough data to fit the prior without
-  overfitting, but makes "is this gait abnormal?" a calibrated question.
-
-#### Preprocessing: what invariance do we want?
-
-The v0.1 implementation is not invariant to anything. Two videos of the
-same subject with a different camera position will give a different
-DTW distance, which is almost certainly not what a clinician wants.
-Candidate preprocessing steps:
-
-- **Translation invariance**: subtract the root joint (pelvis or torso
-  centroid) from every joint per frame, so all poses are expressed in a
-  body-relative coordinate frame. Cheap and almost always desired.
-- **Scale invariance**: divide by a reference length (e.g., torso
-  length, or total skeleton span) so tall and short subjects produce
-  comparable distances. Important for comparing across subjects.
-- **Rotation invariance**: align to a canonical frame (e.g., hip-to-hip
-  vector = x-axis, hip-to-shoulder = z-axis) per frame. Required if the
-  subject's orientation relative to the camera varies between trials.
-- **Procrustes alignment per frame**: fit the best rigid transform
-  (rotation + translation) between pose A's frame and pose B's frame
-  before computing distance. The closed-form
-  [Kabsch algorithm](https://en.wikipedia.org/wiki/Kabsch_algorithm) is
-  fast and exact. This is likely the *right* thing for most comparison
-  use cases but has never been wired up.
-
-The `dtw_relation` helper is translation- and (for unit-vector
-displacements) scale-invariant by construction, which is why it ends up
-being the most useful of the three existing entry points in practice.
-
-#### Representation: coordinates, angles, velocities, or dual?
-
-The v0.1 DTW operates on **3D joint coordinates** (translation-dependent)
-or **joint-pair displacements** (`dtw_relation`). Other representations
-worth comparing:
-
-- **Joint angles.** Using `extract_joint_angles` output as the DTW
-  input gives a rotation-and-translation-invariant comparison that's
-  also directly interpretable in clinical terms.
-- **Joint velocities.** Temporal derivatives of position. Emphasizes
-  *how the pose changes* rather than *what it is* — good for
-  discriminating smooth from jerky motion in gait.
-- **Dual (position + angle).** Concatenate normalized position and
-  angle features into a single per-frame vector. More expressive but
-  requires tuning the relative weights.
-- **Learned embeddings.** Feed each frame through a pretrained
-  pose-representation network (there are a few) and DTW on the
-  embedding space. Expensive and opaque but may capture
-  higher-order structure.
-
-#### Multi-scale approaches
-
-FastDTW is already multi-scale internally. Other ideas:
-
-- **Coarse-to-fine DTW.** Downsample aggressively, run exact DTW on
-  the coarse version to get a sub-quadratic alignment, then refine
-  locally. This is essentially what FastDTW does, but with an explicit
-  signal-processing hat on.
-- **Wavelet-decomposed DTW.** Decompose each joint's trajectory into
-  wavelet coefficients and run DTW on the low-frequency coefficients.
-  Unclear whether this actually helps; interesting because it separates
-  posture (low-frequency) from tremor / micro-motion (high-frequency).
-
-#### Clinical gait: cycle-aware DTW
-
-Gait is approximately periodic, and "the 4th heel-strike of trial A"
-is the clinically meaningful comparison point to "the 4th heel-strike
-of trial B", not "frame 120 of A vs frame 120 of B". A natural two-stage
-approach:
-
-1. **Cycle detection.** Find heel-strikes (or other gait events) via
-   peak detection on a joint's vertical coordinate, and segment each
-   trial into individual cycles.
-2. **Per-cycle DTW.** Time-warp within each cycle independently to
-   normalize cycle duration. The distance between trials is then the
-   sum / mean of per-cycle distances.
-
-This is standard in the biomechanics literature
-([Sadeghi et al. 2000](https://doi.org/10.1016/S0966-6362(00)00074-3)
-and descendants) and is almost certainly a better fit for clinical
-comparison than the naive full-trial DTW we ship today.
-
-#### Soft-DTW for learning applications
-
-[Soft-DTW](https://arxiv.org/abs/1703.01541) is a differentiable
-relaxation of DTW, which means gradients can flow through it. This
-matters if we ever want to train a network to *learn* a distance
-metric or an embedding under a DTW objective — for example, a pose
-encoder whose output space is calibrated to gait similarity. Worth
-keeping on the radar even if we're not training anything today.
-`tslearn` implements it.
-
-### Evaluation strategy
-
-Validating a DTW implementation is harder than validating most things.
-Some ideas for how to know we got it right:
-
-- **Synthetic perturbations.** Take a reference sequence and apply
-  known perturbations (time stretch, added noise, spatial offset) and
-  verify that distance scales monotonically with perturbation magnitude
-  and that invariance properties are honored.
-- **Reference implementation parity.** For a small set of hand-picked
-  pairs, compute DTW distance using `dtaidistance` exact DTW and
-  our implementation, and verify the approximation error is below a
-  documented threshold.
-- **Inter-rater clinical benchmark.** When we have labeled clinical
-  data, measure how well DTW distance correlates with clinician
-  ratings of gait similarity. This is the real test but is gated on
-  having data we can use.
-- **Pathology discrimination.** Can DTW distance separate healthy
-  from impaired gait in a held-out set? This is the usefulness test.
-
-### Open questions
-
-1. Is FastDTW good enough, or should we move to `dtaidistance` exact
-   DTW as the default? (First concrete experiment: pick 20 pairs from
-   whatever reference data we can source, compute distance both ways,
-   see if the approximation error is acceptable.)
-2. What's the right representation for clinical gait DTW — raw
-   coordinates, joint angles, or per-pair displacements?
-3. Should we implement Procrustes alignment as a preprocessing step
-   before any DTW call? (If yes, it belongs in `neuropose.analyzer.features`.)
-4. Should the clinical pipeline use cycle-segmented DTW instead of
-   full-trial DTW? This is a methodological choice with real
-   downstream implications.
-5. Is soft-DTW useful to us, or is it a solution looking for a
-   problem we don't have?
-6. What reference corpus do we use to develop and validate any of this?
-
-### Reading list
-
-- Sakoe, H. & Chiba, S. (1978). "Dynamic programming algorithm
-  optimization for spoken word recognition." The original DTW paper.
-- Salvador, S. & Chan, P. (2007). "Toward accurate dynamic time
-  warping in linear time and space."
-  [PDF](https://cs.fit.edu/~pkc/papers/tdm04.pdf). The FastDTW paper.
-- Cuturi, M. & Blondel, M. (2017). "Soft-DTW: a Differentiable Loss
-  Function for Time-Series." [arXiv 1703.01541](https://arxiv.org/abs/1703.01541).
-- Sadeghi, H. et al. (2000). "Symmetry and limb dominance in able-bodied
-  gait: a review." Biomechanics reference for cycle-aware analysis.
-- `dtaidistance` documentation —
-  <https://dtaidistance.readthedocs.io/>. Worth reading even if we
-  don't switch, for the overview of DTW variants and constraints.
-
-### Next steps
-
-- [ ] Pick 10–20 reference pose-sequence pairs and run both FastDTW and
-      exact DTW on them to quantify the approximation error.
-- [ ] Prototype a Procrustes-aligned preprocessing wrapper and
-      re-run the same pairs.
-- [ ] Sketch a cycle-aware DTW pipeline against a gait dataset we can
-      actually use (identity- and IRB-safe).
-- [ ] Decide whether to keep FastDTW as the default or replace it.
-- [ ] If we replace it: migrate `neuropose.analyzer.dtw` to the new
-      backend in a single commit with no API change.
-
----
-
-## TensorFlow version compatibility
-
-### The question
-
-The pinned MeTRAbs model artifact
-(`metrabs_eff2l_y4_384px_800k_28ds.tar.gz`) is a TensorFlow SavedModel.
-SavedModels embed a producer TF version and depend on a set of TF op
-kernels. Picking a TF version pin that is too low risks Apple Silicon
-install pain (pre-2.16 has no native `darwin/arm64` wheel under the
-`tensorflow` package name); picking one that is too high risks loading
-or runtime failures if MeTRAbs uses ops that have been renamed,
-deprecated, or removed. The goal of this investigation was to find the
-**minimum** pin that works on Linux x86_64, Linux arm64, and macOS arm64
-without forcing platform-conditional dependencies or shipping
-`tensorflow-metal` as a default.
-
-### Method
-
-Phase 0 of the procedure laid out earlier in this document was to
-inspect the SavedModel directly and run `detect_poses` end-to-end on a
-synthetic input. The probe script (`test.py` at the repo root, kept
-during the investigation and removed in the same commit that landed the
-pin) did three things:
-
-1. Parsed `saved_model.pb` with `saved_model_pb2.SavedModel` and read
-   the `tensorflow_version` and `tensorflow_git_version` fields out of
-   each `meta_info_def` to establish the **producer** version.
-2. Walked every `node.op` and `library.function[*].node_def[*].op` in
-   the graph to enumerate the **complete set of ops** the model relies
-   on. This is the binary-compatibility surface — anything in this set
-   that gets removed in a future TF release breaks the model.
-3. Called `tf.saved_model.load(MODEL_DIR)`, accessed
-   `per_skeleton_joint_names["berkeley_mhad_43"]`, and invoked
-   `model.detect_poses(image, intrinsic_matrix=..., skeleton="berkeley_mhad_43")`
-   on a 288×384 black frame to confirm the consumer TF version actually
-   *runs* the model (not just loads it — these are different failure
-   modes).
-
-The probe ran on Linux x86_64 against whatever `uv sync --group dev`
-resolved at the time, which was **TensorFlow 2.21.0** with **Keras
-3.14.0** — i.e. the most recent TF release as of 2026-04 and a version
-that crosses the Keras-3 cutover at TF 2.16.
-
-### Result
-
-- **Producer version:** `tf version: 2.10.0`,
-  `producer: v2.10.0-0-g359c3cdfc5f`. The model was serialized in
-  September 2022, consistent with the file mtimes in the extracted
-  tarball.
-- **Custom ops:** **zero**. `tf.raw_ops.__dict__` filtered for
-  `"metrabs"` returned `[]`. Every op in the SavedModel is a stock
-  TensorFlow kernel that has been stable since at least TF 2.4.
-- **Op inventory** (recorded for posterity so a future contributor can
-  diff against a newer MeTRAbs release without re-running the probe):
-
-  ```
-  Abs, Add, AddV2, All, Any, Assert, AssignVariableOp, AvgPool,
-  BatchMatMulV2, BiasAdd, Bitcast, BroadcastArgs, BroadcastTo, Cast,
-  Ceil, Cholesky, CombinedNonMaxSuppression, ConcatV2, Const, Conv2D,
-  Cos, Cross, Cumsum, DepthwiseConv2dNative, Einsum, EnsureShape, Equal,
-  Exp, ExpandDims, Fill, Floor, FloorDiv, FloorMod, FusedBatchNormV3,
-  GatherV2, Greater, GreaterEqual, Identity, IdentityN, If,
-  ImageProjectiveTransformV3, LeakyRelu, Less, LessEqual, Log,
-  LogicalAnd, LogicalNot, LogicalOr, LookupTableExportV2,
-  LookupTableFindV2, LookupTableImportV2, MatMul, MatrixDiagV3,
-  MatrixInverse, MatrixSolveLs, MatrixTriangularSolve, Max, MaxPool,
-  Maximum, Mean, MergeV2Checkpoints, Min, Minimum, Mul,
-  MutableDenseHashTableV2, Neg, NoOp, NonMaxSuppressionWithOverlaps,
-  NotEqual, Pack, Pad, PadV2, PartitionedCall, Placeholder, Pow, Prod,
-  RaggedRange, RaggedTensorFromVariant, RaggedTensorToTensor,
-  RaggedTensorToVariant, Range, Rank, ReadVariableOp, RealDiv, Relu,
-  Reshape, ResizeArea, ResizeBilinear, RestoreV2, ReverseV2,
-  RngReadAndSkip, SaveV2, Select, SelectV2, Shape, ShardedFilename,
-  Sigmoid, Sin, Size, Slice, Softplus, Split, SplitV, Sqrt, Square,
-  Squeeze, StatefulPartitionedCall, StatelessIf,
-  StatelessRandomUniformV2, StatelessWhile, StaticRegexFullMatch,
-  StridedSlice, StringJoin, Sub, Sum, Tan, Tanh, TensorListConcatV2,
-  TensorListFromTensor, TensorListGetItem, TensorListReserve,
-  TensorListSetItem, TensorListStack, Tile, TopKV2, Transpose, Unpack,
-  VarHandleOp, Where, While, ZerosLike
-  ```
-
-- **Load:** `tf.saved_model.load` returned a `_UserObject` with
-  `detect_poses` exposed. No warnings about deprecated kernels, no
-  errors. The 11-minor-version forward jump from producer 2.10 to
-  consumer 2.21 was a non-event, including the Keras 3 cutover at 2.16.
-- **Skeleton check:** `per_skeleton_joint_names["berkeley_mhad_43"]` had
-  shape `(43,)` and `per_skeleton_joint_edges["berkeley_mhad_43"]` had
-  shape `(42, 2)`, exactly matching what
-  `tests/integration/test_estimator_smoke.py` asserts.
-- **End-to-end inference:** `model.detect_poses` on a black 288×384
-  frame returned `{'poses3d': (0, 43, 3), 'boxes': (0, 5),
-  'poses2d': (0, 43, 2)}`, all `float32`. Zero detections is the
-  correct output for a black frame — the important signal is that the
-  shapes, dtypes, and key names exactly match what `FramePrediction` in
-  `neuropose.io` is built to ingest, so the entire estimator pipeline
-  is wire-compatible with this TF version.
-
-### Decision
-
-Pin `tensorflow>=2.16,<2.19`. Reasoning:
-
-1. **2.16 is the Apple Silicon floor that matters.** TF 2.16 is the
-   first release with native `darwin/arm64` wheels published on PyPI
-   under the `tensorflow` package name. Below 2.16, Mac users would
-   need `tensorflow-macos` (a separate Apple-maintained package), which
-   forces ugly platform markers in `pyproject.toml` and means Linux and
-   Mac users run subtly different codebases. Above 2.16, the same
-   single dependency line installs cleanly on every supported platform.
-2. **MeTRAbs imposes no upper bound below 3.0.** Producer 2.10 → consumer
-   2.21 (an 11-minor-version jump across the Keras 3 boundary) loaded
-   and ran without a single complaint. The op inventory is 100% stock,
-   so future TF 2.x releases would only break this if they removed
-   stable kernels — which would itself be a TF 2.x SemVer violation.
-3. **`tensorflow-metal` is an opt-in extra, not a default.**
-   `tensorflow-metal` is a PluggableDevice that Apple ships separately
-   to add a Metal-backed `/GPU:0`. It has its own version-compatibility
-   table (Apple maintains it at
-   `developer.apple.com/metal/tensorflow-plugin/`), has a documented
-   history of producing silently-wrong numerics on specific TF ops,
-   and breaks intermittently on Keras 3. For a clinical-research
-   pipeline where reproducibility matters more than inference latency,
-   CPU inference on Mac is the right default. We do ship a
-   `[project.optional-dependencies].metal` extra that pulls
-   `tensorflow-metal>=1.2,<2` under darwin/arm64 platform markers, so
-   users who want the speedup can opt in via
-   `pip install 'neuropose[metal]'` — but the Metal path is not
-   exercised in CI, is documented as experimental in
-   `docs/getting-started.md`, and users are expected to spot-check
-   `poses3d` output against the CPU path before trusting it for any
-   clinical measurement.
-4. **`tensorflow-metal` forces a TF upper bound.** `tensorflow-metal`
-   1.2.0 (released January 2025, the latest version as of 2026-04) is
-   advertised as supporting "TF 2.18+" but in practice fails on
-   2.19 and 2.20 with symbol-not-found errors and graph-execution
-   `InvalidArgumentError`s. See
-   [tensorflow/tensorflow#84167](https://github.com/tensorflow/tensorflow/issues/84167)
-   and the Apple Developer forum threads at
-   [developer.apple.com/forums/thread/772147](https://developer.apple.com/forums/thread/772147)
-   and [developer.apple.com/forums/thread/803658](https://developer.apple.com/forums/thread/803658).
-   2.18.x is the last version confirmed to work cleanly on Apple
-   Silicon GPU. Even though the Metal path is opt-in, dependency
-   resolution is shared — if uv resolves `tensorflow` to 2.21 on a
-   Linux developer's machine and 2.18 on the Mac, lockfile churn
-   and "works on my box" become permanent. Cap is therefore applied
-   globally rather than via a darwin/arm64 marker split. Cost on
-   Linux is zero: nothing in the pipeline depends on TF 2.19+
-   features, and the SavedModel ran fine on TF 2.21 in the probe
-   above, so the cap is purely an external-package constraint. Lift
-   it once Apple ships a Metal plugin that tracks mainline
-   TensorFlow again.
-
-### What is **not** yet verified
-
-- The probe ran on Linux x86_64 only. macOS arm64 has not been exercised
-  on real hardware. The argument that it should work is by construction
-  — `tensorflow==2.16+` ships native arm64 macOS wheels, the SavedModel
-  uses zero custom ops, and there is no MeTRAbs-side platform code — but
-  empirical confirmation is still pending.
-- Linux arm64 has likewise not been exercised. Same by-construction
-  argument applies.
-- A `macos-14` GitHub Actions matrix entry (which would run the unit
-  tests on Apple Silicon hardware) is the cheapest way to catch any
-  regression and is the intended follow-up.
-- Inference-output numerics have not been compared across platforms.
-  This is the next layer of rigor below "does it run" — we expect
-  fp32 results to match within ~1e-3 mm on `poses3d`, but a real
-  cross-platform diff against a reference set has not been done.
-- The `[metal]` optional-dependencies extra exists in `pyproject.toml`
-  but the Metal code path has never been exercised against the
-  pinned MeTRAbs SavedModel. Enabling it is a pure opt-in and comes
-  with a documented "verify your own numerics" caveat in
-  `docs/getting-started.md`. Whether it actually produces a speedup
-  on EfficientNetV2-L-based inference on real clinical videos —
-  and whether that speedup is worth the numerical-divergence risk
-  — is unknown.
-
-### Open questions
-
-1. Does the same `detect_poses` call produce numerically equivalent
-   `poses3d` on macOS arm64 as on Linux x86_64 against a real (non-black)
-   reference image? Within what tolerance?
-2. If a future MeTRAbs release introduces a custom op (e.g. for a new
-   detector head), how do we want the loader to fail? Currently the
-   `_REQUIRED_MODEL_ATTRS` interface check would still pass; the failure
-   would surface at first `detect_poses` call, which is late.
-3. ~~Does it make sense to upper-bound the pin more tightly than `<3.0`
-   (e.g. `<2.22` to bound to tested versions), or is the SemVer guard
-   sufficient given the all-stock-ops result?~~ **Resolved 2026-04-16.**
-   Tightened to `<2.19` for `tensorflow-metal` compatibility. See
-   reasoning point 4 in the Decision section above.
-
-### Next steps
-
-- [ ] Run the same probe on real macOS arm64 hardware and log the
-      result (load success, detect_poses success, output numerics
-      diff against the Linux baseline).
-- [ ] Add a `macos-14` matrix entry to `.github/workflows/ci.yml` for
-      the unit tests. Slow tests stay Linux-only to avoid doubling the
-      MeTRAbs download cost in CI.
-- [ ] Re-run the probe whenever MeTRAbs upstream publishes a new model
-      tarball, and diff the op inventory above. Any new op that is not
-      in the list above is a flag worth investigating before raising
-      the pin.
-- [ ] Benchmark `[metal]` vs CPU on a real Apple Silicon Mac against
-      a short reference clip: measure (a) per-frame latency, (b) peak
-      memory, and (c) `poses3d` divergence from the CPU baseline. If
-      the speedup is meaningful and the numerics are within
-      ~1e-2 mm, move the `metal` extra from "experimental" to
-      "supported" in the docs. If not, document the failure mode
-      here and keep the extra where it is.
-
----
-
-## MeTRAbs hosting and extensibility
-
-### Current state (v0.1, commit 11)
-
-The model loader in `neuropose._model.load_metrabs_model` will pin the
-canonical upstream URL:
-
-```
-https://omnomnom.vision.rwth-aachen.de/data/metrabs/metrabs_eff2l_y4_384px_800k_28ds.tar.gz
-```
-
-This is the RWTH Aachen "omnomnom" host — a raw HTTP file server run
-by the MeTRAbs authors' lab. There is no current HuggingFace mirror
-of the relevant MeTRAbs variant at the time of commit 11.
-
-The URL encodes the model configuration:
-`metrabs_eff2l_y4_384px_800k_28ds` means the EfficientNetV2-L backbone,
-YOLOv4 detector head, 384-pixel input, 800k training steps, trained on
-28 datasets. This name pattern is worth preserving when we host the
-model ourselves so future variants stay self-describing.
-
-### Supply-chain concerns
-
-Pinning a single upstream URL to a third-party academic host is a
-real supply-chain risk, and the audit of the previous prototype called
-it out explicitly (the old code used `bit.ly/metrabs_1`, which was
-even worse). Concrete failure modes:
-
-- The RWTH Aachen host goes down or is decommissioned.
-- The URL changes when Sárándi releases a new MeTRAbs version.
-- The tarball contents change under the same URL without a version bump.
-
-**Minimum mitigation** (should land in or immediately after commit 11):
-
-- **Pin a SHA-256 checksum** alongside the URL, and verify on download
-  before unpacking. If the checksum doesn't match, fail hard with a
-  clear error.
-- **Cache aggressively.** Once downloaded and verified, never hit the
-  network again for the same configuration. `model_cache_dir` is
-  already in `Settings`.
-- **Document the exact filename and checksum** in `RESEARCH.md` (or
-  migrate to a `MODEL_ARTIFACTS.md` file) so operators have a way to
-  manually download the model out-of-band if the primary URL is dead.
-
-### Self-hosting options
-
-We want to host the model ourselves, both for reliability and because
-it opens the door to future fine-tuning and redistribution of our own
-variants. Candidate hosting approaches:
-
-#### Forgejo LFS
-
-Pros:
-- Lives next to the code.
-- Version-controlled artifacts.
-- Access control mirrors repo access.
-
-Cons:
-- LFS is designed for git-tracked binary assets, not for large
-  infrequently-updated model weights — you pay LFS overhead on every
-  clone unless you configure `lfs.fetchexclude`.
-- Model is ~2.2 GB; Forgejo LFS performance at that size is untested
-  for our instance.
-- Pinning is by LFS pointer, which means the model is coupled to a
-  particular repo revision. Messy if we want multiple code revisions
-  to share the same model.
-
-**Verdict:** Workable but not the best fit.
-
-#### Forgejo generic package registry
-
-Forgejo supports a [generic package
-registry](https://forgejo.org/docs/latest/user/packages/generic/) that
-can host arbitrary binary artifacts with versioned URLs. This is
-closer to what we want:
-
-```
-https://git.levineuwirth.org/api/packages/neuwirth/generic/metrabs/eff2l_y4_384px_800k_28ds/metrabs.tar.gz
-```
-
-Pros:
-- Versioned URLs decoupled from repo revisions.
-- Upload once, download many times, no clone coupling.
-- Integrated auth if we want to gate access.
-- Can be made public even if the repo is private.
-
-Cons:
-- Requires uploading the file manually or via an API call.
-- Forgejo registry size / bandwidth limits depend on the instance.
-
-**Verdict:** Probably the best fit for "we want it hosted alongside
-the project."
-
-#### Plain HTTP server on a VPS subdomain
-
-A dedicated subdomain like `models.levineuwirth.org` backed by a
-simple HTTP file server (nginx `autoindex`, or Caddy with a tidy
-directory layout). Example URL:
-
-```
-https://models.levineuwirth.org/metrabs/metrabs_eff2l_y4_384px_800k_28ds.tar.gz
-```
-
-Pros:
-- Simplest possible story. No API, no auth machinery.
-- Easy to mirror from — anyone can curl the URL.
-- Decoupled from the git forge, so we can share models publicly even
-  when the repo itself is private.
-- Easy to put a CDN in front (Cloudflare) if bandwidth ever matters.
-
-Cons:
-- Manual upload via scp/rsync.
-- No access control unless we add it.
-- No versioning beyond filename convention.
-
-**Verdict:** Strong candidate. This is probably the right choice for
-v0.1 of self-hosted models.
-
-#### S3-compatible object storage (MinIO self-hosted)
-
-Run MinIO on the VPS, get S3-compatible API for free, and serve models
-via pre-signed URLs or a public bucket.
-
-Pros:
-- Proper object storage with ETags, range requests, multipart uploads.
-- Integration story is straightforward if we ever move to cloud-hosted
-  storage.
-- Industry-standard API.
-
-Cons:
-- More operational complexity than a plain HTTP server for what might
-  be a handful of files.
-
-**Verdict:** Overkill for v0.1 but worth revisiting if model storage
-becomes a real operational concern.
-
-### Integrity: SHA-256 pinning
-
-Regardless of which hosting approach we pick, **the model loader should
-always verify a SHA-256 checksum** before trusting the downloaded
-artifact. This is the one piece of supply-chain hygiene that has to be
-in place before we ship commit 11 to any user outside the Shu lab.
-
-Implementation sketch for `neuropose/_model.py`:
-
-```python
-def load_metrabs_model(cache_dir: Path | None = None) -> Any:
-    cache_dir = cache_dir or _default_model_cache_dir()
-    cache_dir.mkdir(parents=True, exist_ok=True)
-    tarball = cache_dir / _MODEL_FILENAME
-    if not tarball.exists():
-        _download(_MODEL_URL, tarball)
-    _verify_sha256(tarball, _MODEL_SHA256)
-    extracted = _extract_if_needed(tarball, cache_dir)
-    return tfhub.load(str(extracted))  # or tf.saved_model.load
-```
-
-The `_MODEL_SHA256` constant is the source of truth; if it ever has
-to change, the constant change is visible in the git diff and a human
-reviews it.
-
-### Fine-tuning
-
-The next research direction after we have inference working is
-fine-tuning MeTRAbs on clinical-specific data. Open questions:
-
-- **What data?** Any clinical data is IRB-gated. Even de-identified
-  pose data may carry subject information if the recording conditions
-  (lighting, room layout) are distinctive enough. Any training plan
-  has to run through the data-handling policy that lives (will live)
-  in `docs/data-policy.md`.
-- **Transfer learning strategy.**
-  - *Head-only fine-tuning*: freeze the EfficientNetV2-L backbone and
-    re-train the pose regression head on clinical data. Fast, low
-    compute, unlikely to overfit, but also unlikely to capture
-    clinical-pose idiosyncrasies.
-  - *Low-LR full fine-tune*: unfreeze everything, use a learning rate
-    1/100th of the original, train for a few epochs. Better
-    adaptation, higher risk of catastrophic forgetting.
-  - *Adapter layers*: insert small trainable adapters into the frozen
-    backbone. Parameter-efficient, well-studied in NLP, less common
-    for pose but should work.
-- **Compute requirements.** EfficientNetV2-L is roughly 120M parameters;
-  fine-tuning on a single modern GPU (24 GB VRAM) is feasible at
-  reduced batch size. A multi-GPU node is friendlier but not strictly
-  required.
-- **Evaluation.** We need held-out clinical data with trusted ground
-  truth. MoCap-derived poses are the gold standard; marker-based MoCap
-  systems provide sub-millimeter accuracy at the cost of subject
-  instrumentation. The Shu lab's access to MoCap is the gating factor.
-- **Sharing fine-tuned weights.** If we fine-tune on clinical data, the
-  resulting weights may encode subject information in ways that are
-  non-obvious and potentially IRB-relevant. Sharing fine-tuned weights
-  externally has to be cleared through the same channels as sharing the
-  training data.
-
-### Training our own pose estimator
-
-The long-range version of the research direction: train a pose
-estimator from scratch that extends MeTRAbs's methodology. MeTRAbs is
-a good starting point because the method is well-documented:
-
-- Sárándi, I., et al. (2020). "MeTRAbs: Metric-Scale Truncation-Robust
-  Heatmaps for Absolute 3D Human Pose Estimation."
-  [arXiv 2007.07227](https://arxiv.org/abs/2007.07227),
-  IEEE Transactions on Biometrics, Behavior, and Identity Science.
-
-Core contributions (worth knowing if you modify any of this):
-
-- **Truncation-robust heatmaps.** Instead of predicting a 2D heatmap
-  bounded by the image, MeTRAbs predicts a heatmap that extends
-  *outside* the image and can place a joint at coordinates the image
-  alone could not disambiguate. Critical for crops where the subject
-  is partially out of frame.
-- **Metric scale regression.** MeTRAbs predicts the absolute 3D
-  positions of joints in millimetres by combining a 2D heatmap with a
-  per-joint depth regressor. Most 3D pose methods produce only
-  relative coordinates, which are useless for clinical measurement.
-- **Multi-dataset training with a common skeleton.** The 28-dataset
-  training set unifies disparate skeleton topologies into a common
-  43-joint Berkeley MHAD skeleton, which we carry forward in
-  NeuroPose.
-
-**Natural extensions worth considering:**
-
-- **Temporal smoothing head.** MeTRAbs is a per-frame model. Clinical
-  gait analysis wants temporally smooth trajectories. Adding a
-  lightweight temporal head (1D CNN or small transformer over frame
-  sequences) could produce smoother outputs without touching the
-  backbone.
-- **Clinical-specific heatmap supervision.** If we have MoCap data for
-  clinical poses, we can use it as ground-truth heatmap supervision to
-  improve accuracy in the pose ranges the model sees least often in
-  the 28-dataset training corpus (e.g., pathological gaits, walker-
-  assisted ambulation).
-- **Multi-person identity tracking.** MeTRAbs produces detections per
-  frame without continuity across frames. Adding a Hungarian-matched
-  tracker (or a learned tracker) would solve the multi-person
-  identity problem that `predictions_to_numpy` currently dodges with
-  a `person_index` parameter.
-- **Alternative backbones.** EfficientNetV2-L is a 2020-era choice.
-  Newer backbones (ConvNeXt, DINOv2-initialized ViTs) may give
-  meaningful gains, especially for clinical poses that are
-  under-represented in the original training set.
-- **Uncertainty estimation.** Clinical users want to know when the
-  model is unsure. A Gaussian output head (mean + variance per joint)
-  or an ensemble-based approach would let us propagate uncertainty
-  into downstream analysis.
-
-**Compute requirements:** training MeTRAbs from scratch was reported
-as "a few weeks" on 8x V100 in the original paper. A from-scratch
-re-training is a substantial undertaking. Fine-tuning is much more
-accessible.
-
-### Collaboration opportunities
-
-- **István Sárándi** (now at University of Tübingen, formerly RWTH
-  Aachen) is the author of MeTRAbs. The code is MIT-licensed and he
-  has historically been responsive to collaboration requests. If we
-  end up publishing work that significantly extends MeTRAbs, at the
-  very least we should reach out about co-authorship or
-  acknowledgment; at best we might find an active collaborator.
-- **The Shu Lab's existing collaborators** on clinical gait research
-  at Brown and partner institutions may have MoCap-validated datasets
-  we can use for fine-tuning and evaluation. Worth asking Dr. Shu.
-
-### Open questions
-
-1. Does Forgejo's generic package registry actually handle a 2.2 GB
-   upload cleanly, or do we need the plain HTTP server route?
-2. What's the right SHA-256 pin to commit alongside the URL? (Need to
-   download the tarball first and run `sha256sum`.)
-3. Do we have access to MoCap-validated clinical gait data for
-   fine-tuning evaluation? This gates every training-related
-   experiment.
-4. Is fine-tuning even worth pursuing before we have inference results
-   that are clearly *not* good enough on clinical data? (I.e.,
-   motivate the work with concrete failure cases rather than assuming
-   a delta we haven't measured.)
-5. Does it make sense to reach out to Sárándi now, or wait until we
-   have something concrete to collaborate on?
-
-### Reading list
-
-- Sárándi, I. et al. (2020). "MeTRAbs: Metric-Scale Truncation-Robust
-  Heatmaps for Absolute 3D Human Pose Estimation."
-  [arXiv 2007.07227](https://arxiv.org/abs/2007.07227). **Essential
-  reading** for anyone planning to extend the method.
-- Sárándi's personal site and the MeTRAbs GitHub repo
-  (<https://github.com/isarandi/metrabs>) — the code, model zoo, and
-  training scripts live here.
-- Zheng, C. et al. (2023). "Deep Learning-Based Human Pose Estimation: A
-  Survey." Good survey paper for orienting on the state of the art.
-- The original 28-dataset training composition referenced in the
-  MeTRAbs paper — worth tracing through to understand what poses are
-  in- and out-of-distribution for the pretrained model.
-
-### Next steps
-
-- [ ] Download the pinned tarball and compute its SHA-256 for the
-      commit-11 model loader.
-- [ ] Decide between Forgejo generic registry and plain HTTP subdomain
-      for self-hosting. Prototype whichever one wins.
-- [ ] Mirror the pinned tarball to the chosen self-hosted location so
-      we can fall over to it the moment the RWTH URL changes or goes
-      down.
-- [ ] Write a one-page "MODEL_ARTIFACTS.md" that documents every model
-      version we use, its checksum, and its canonical source URL.
-- [ ] Have the data-access conversation with Dr. Shu about clinical
-      training data. Everything else is blocked on this.
-- [ ] (Much later) Reach out to Sárándi about potential collaboration
-      once we have something concrete to talk about.
diff --git a/TECHNICAL.md b/TECHNICAL.md
deleted file mode 100644
index b48238b..0000000
--- a/TECHNICAL.md
+++ /dev/null
@@ -1,1191 +0,0 @@
-# NeuroPose Technical Ideation Notes
-
-A living engineering roadmap, parallel to `RESEARCH.md`. Where
-`RESEARCH.md` captures open methodological questions (DTW, skeleton
-choice, hosting the model), this document captures open *engineering*
-questions — release readiness, operability, scaling — and the paths
-they could take.
-
-This is **not** user-facing documentation. Items here are *candidates*
-for future work, and inclusion does not imply commitment.
-
-## How to use this document
-
-- Add a section when you start thinking about a new area of technical
-  investment.
-- Each section should end with a **Scope**, **Sketch**, or **Open
-  questions** block so it's obvious to a future you (or a new
-  contributor) what the concrete next move would be.
-- When an item in here is decided and implemented, move it to the
-  relevant place in `docs/` or in the code itself, and leave a short
-  pointer behind (*See `docs/deployment.md` for the resolved design.*).
-- The audience is anyone maintaining the codebase — Levi, David,
-  Praneeth, Dr. Shu, and whoever comes after us. Assume competence in
-  Python and systems work; don't assume familiarity with our specific
-  tooling choices.
-
-## Three phases, then a contingent track
-
-There are four distinct technical objectives, ordered by timeline and
-by what each enables next. The sequencing is deliberate: each phase
-unblocks the next, and doing them in any other order either publishes
-Paper C on top of a pipeline its own design notes disavow, or delays
-the open-source release past the window where the accompanying paper
-is still salient.
-
-1. **Phase 0 — C-enabling pipeline work.** A targeted subset of
-   engineering work that has to land *before* Paper C can start. The
-   DTW defaults shipped in 0.1 are explicitly a "mechanical port, not
-   a methodological choice" (see `RESEARCH.md` §1); running the
-   clinical validation study on them would mean publishing results
-   from a pipeline the accompanying design notes explicitly criticize.
-   Phase 0 fixes the analyzer's methodological foundations (Procrustes
-   preprocessing, cycle segmentation, joint-angle DTW representation),
-   locks in the reproducibility surface (`Provenance` subobject,
-   YAML-configurable analysis pipeline), and sets up schema migration
-   so data generated during Phase 1 survives the long write-up.
-   **Near-term, well-scoped, weeks of work.**
-
-2. **Phase 1 — Paper C: clinical validation study.** The planned
-   clinical-methods paper: cycle-aware joint-angle DTW for clinical
-   gait similarity, validated against MoCap ground truth and/or
-   clinician ratings. Gated on MoCap data access via Dr. Shu. This is
-   research work, not engineering work — this document describes the
-   engineering scaffolding *around* it, not the paper itself. Phase 2
-   work can happen in the background during this phase as ideal filler
-   for research-burnout cycles. **Months; timeline driven by data
-   access and experimental design.**
-
-3. **Phase 2 — Coordinated open-source release + Paper A.** The
-   engineering-paper companion (A) describing the tech stack, plus
-   the tagged 0.1 release: PyPI publication, docs deployment, Docker
-   images, CI matrix, supervision artifacts, doctor preflight, all
-   the operational items that make the tool credible to external
-   users. Timed to arrive *with or slightly before* Paper C's
-   submission, producing a paper-plus-tool bundle that reviewers can
-   actually run. **Weeks of work, timing driven by Paper C's
-   submission window.**
-
-4. **Track 2 — Clinical platform (contingent).** Everything beyond
-   the open-source research tool — multi-tenancy, audit logging,
-   HTTP/API layer, clinician UI, clinical-system integrations. Not
-   sequenced; activates only if specific triggers fire (external
-   demand, multi-site ambition, funding mandate, publication
-   traction). Most of this is background thinking, not planned work.
-   The value of keeping it in this document is so that Phase 0 and
-   Phase 2 decisions don't accidentally foreclose Track 2 options.
-
-Phases 0 → 1 → 2 form a near-term sequence that culminates in a
-paper-plus-release bundle. Track 2 sits outside that sequence and
-does not gate any of it.
-
-## Contents
-
-- [Phase 0: C-enabling pipeline work](#phase-0-c-enabling-pipeline-work)
-  - [Procrustes preprocessing](#procrustes-preprocessing)
-  - [Gait cycle segmentation](#gait-cycle-segmentation)
-  - [Joint-angle DTW representation](#joint-angle-dtw-representation)
-  - [Provenance subobject](#provenance-subobject)
-  - [YAML-configurable analysis pipeline](#yaml-configurable-analysis-pipeline)
-  - [Schema migration for VideoPredictions](#schema-migration-for-videopredictions)
-- [Phase 1: Clinical validation study (Paper C)](#phase-1-clinical-validation-study-paper-c)
-- [Phase 2: Coordinated open-source release + Paper A](#phase-2-coordinated-open-source-release--paper-a)
-  - [Release definition](#release-definition)
-  - [Apple Silicon CI matrix](#apple-silicon-ci-matrix)
-  - [Mac hardware validation pass](#mac-hardware-validation-pass)
-  - [Retention and pruning](#retention-and-pruning)
-  - [neuropose doctor preflight](#neuropose-doctor-preflight)
-  - [Process supervision artifacts](#process-supervision-artifacts)
-  - [Structured logging option](#structured-logging-option)
-  - [Monitor authentication](#monitor-authentication)
-  - [Docker GPU image](#docker-gpu-image)
-  - [Dependency freshness automation](#dependency-freshness-automation)
-  - [Release workflow](#release-workflow)
-  - [Error-path test coverage expansion](#error-path-test-coverage-expansion)
-- [Track 2: Clinical platform (contingent)](#track-2-clinical-platform-contingent)
-  - [Triggers to activate Track 2](#triggers-to-activate-track-2)
-  - [Multi-tenancy and identity](#multi-tenancy-and-identity)
-  - [Audit logging and compliance posture](#audit-logging-and-compliance-posture)
-  - [HTTP/API layer](#httpapi-layer)
-  - [Clinician-facing UI](#clinician-facing-ui)
-  - [Horizontal scaling](#horizontal-scaling)
-  - [Backup, replication, and data durability](#backup-replication-and-data-durability)
-  - [Clinical-system integrations](#clinical-system-integrations)
-  - [Deterministic inference mode](#deterministic-inference-mode)
-  - [Observability and SLOs](#observability-and-slos)
-  - [Supply-chain attestation and signed releases](#supply-chain-attestation-and-signed-releases)
-  - [Deployment orchestration](#deployment-orchestration)
-- [Decisions to not prematurely foreclose](#decisions-to-not-prematurely-foreclose)
-
----
-
-## Phase 0: C-enabling pipeline work
-
-The six items below are prerequisites for Paper C. Until they are
-landed, every analysis C would produce would be running on defaults
-that `RESEARCH.md` §1 explicitly flags as provisional. Ship these
-first, in any order that suits the implementer's cadence, and the
-rest of the project can pick up with confidence that Phase 1 results
-are trustworthy.
-
-### Procrustes preprocessing
-
-**Status:** Not implemented. `neuropose.analyzer.features` ships
-`extract_joint_angles` and feature-statistics helpers; no alignment
-step exists between pose sequences.
-
-**Why it matters for Paper C:** without alignment, DTW distance is
-translation- and orientation-dependent. Two recordings of the same
-subject from different camera positions produce different distances,
-which is almost never what a clinician wants. Paper C's methods
-section would need to apologize for this in print; cheaper to fix the
-method than to defend it.
-
-**Scope:**
-
-- Add `procrustes_align(a: np.ndarray, b: np.ndarray, *, mode:
-  Literal["per_frame", "per_sequence"]) -> tuple[np.ndarray,
-  np.ndarray, AlignmentDiagnostics]` to `neuropose.analyzer.features`.
-  Implements the Kabsch algorithm (closed-form optimal rigid
-  transform). Per-frame aligns each frame of A to the corresponding
-  frame of B independently; per-sequence computes one transform over
-  the whole sequence. Both are useful — per-frame for fine-grained
-  matching, per-sequence for preserving within-trial dynamics.
-- Return aligned arrays plus an `AlignmentDiagnostics` dataclass with
-  the fitted rotation magnitude and translation magnitude so
-  downstream code can flag suspiciously large transforms (usually a
-  sign of upstream annotation error).
-- Expose as an opt-in `align: Literal["none", "procrustes_per_frame",
-  "procrustes_per_sequence"] = "none"` parameter on every DTW entry
-  point in `neuropose.analyzer.dtw`. Default `none` preserves current
-  behavior; Paper C's pipeline sets it to `procrustes_per_sequence`.
-- Unit tests: construct a known rotation + translation between two
-  synthetic skeletons, verify alignment recovers it to within
-  floating-point precision; verify alignment of a sequence with its
-  own translated copy produces zero residual.
-
-**Non-scope:**
-
-- Non-rigid alignment (thin-plate splines, learned registration). Not
-  needed for skeleton-level comparison and would be a research
-  contribution on its own.
-
-**Open question:** should alignment also include optional scaling
-(scaled-Procrustes / full Procrustes)? For cross-subject comparison
-it almost certainly should. Default to scale-preserving and add a
-`scale: bool = False` flag; Paper C can flip it on for cross-subject
-figures.
-
-### Gait cycle segmentation
-
-**Status:** `segment_by_peaks` in `neuropose.analyzer.segment`
-performs generic valley-to-valley segmentation on a supplied 1D
-signal. There is no gait-specific wrapper that knows to look at the
-heel's vertical coordinate.
-
-**Why it matters for Paper C:** clinical gait analysis wants to
-compare *the 4th heel-strike of trial A* to *the 4th heel-strike of
-trial B*, not *frame 120 of A vs frame 120 of B*. Per-cycle DTW is
-the standard approach in the biomechanics literature (Sadeghi et al.
-2000 and descendants); running full-trial DTW on gait is a choice
-reviewers of Paper C would correctly flag as methodologically weak.
-
-**Scope:**
-
-- New `segment_gait_cycles(predictions: VideoPredictions, *, joint:
-  str = "rhee", axis: Literal["x", "y", "z"] = "y", min_cycle_seconds:
-  float = 0.4) -> Segmentation` in `neuropose.analyzer.segment`.
-- Under the hood: extract the specified joint's coordinate along the
-  specified axis, apply `segment_by_peaks` with appropriate distance
-  and prominence thresholds (derived from `min_cycle_seconds` via
-  `predictions.metadata.fps`), return the resulting `Segmentation`
-  (the existing `neuropose.io.Segmentation` type) so downstream
-  tooling picks it up unchanged.
-- Two-sided detection: run the same detection on the opposite heel
-  and return *both* per-side segmentations under named keys
-  (`left_heel_strikes`, `right_heel_strikes`). Clinical users will
-  want both.
-- Allow the reference joint and axis to be configurable so trials
-  recorded with a different camera orientation (lateral vs frontal
-  vs oblique) can still be segmented without a code change.
-
-**Non-scope:**
-
-- HMM-based cycle detection, learned cycle detectors. Peak detection
-  on vertical coordinate is standard, well-understood, and the
-  method the biomechanics literature expects to see.
-- Handling pathological gaits where heel-strikes are absent
-  (shuffling, walker-assisted). The function should degrade
-  gracefully (return a `Segmentation` with an empty list, not raise),
-  and Paper C's data-quality filtering handles the rest.
-
-**Open question:** should the function also emit a "confidence" per
-cycle (prominence of the detected peak, regularity of spacing) that
-Paper C can use to filter out low-quality detections? Cheap to add,
-useful downstream. Recommend yes.
-
-### Joint-angle DTW representation
-
-**Status:** `dtw_all`, `dtw_per_joint`, and `dtw_relation` operate on
-raw 3D coordinates or joint-pair displacements. `extract_joint_angles`
-produces per-frame angle sequences but is not wired as a DTW input.
-
-**Why it matters for Paper C:** angle-space DTW is translation- and
-rotation-invariant by construction, scale-invariant if normalized,
-and directly interpretable in clinical terms ("knee flexion angle
-during swing phase"). Paper C's headline figures almost certainly
-use angle-space distances; raw coordinates would draw the obvious
-reviewer question of why we aren't comparing the thing clinicians
-actually measure.
-
-**Scope:**
-
-- Add `representation: Literal["coords", "angles", "relation"] =
-  "coords"` to every DTW entry point. The `coords` default preserves
-  existing behavior; `angles` runs `extract_joint_angles` on each
-  input first; `relation` is the existing `dtw_relation` path
-  expressed as a representation choice rather than a separate
-  function (leaving the `dtw_relation` name as a convenience wrapper
-  if preferred).
-- Degenerate-vector handling: `extract_joint_angles` returns NaN for
-  degenerate (zero-length) vectors. The DTW path needs to decide how
-  to handle NaN — skip-and-interpolate, drop, or propagate to the
-  distance. Propagation is safest (makes the problem visible);
-  interpolation is what clinical users probably want day-to-day.
-  Default to propagation and expose `nan_policy: Literal["propagate",
-  "interpolate", "drop"]` for experimentation.
-- Tests: synthetic pair with known angular difference, assert DTW in
-  angle-space recovers it independent of global rotation applied to
-  the input.
-
-**Non-scope:**
-
-- Quaternion or SO(3) rotation-space DTW. Interesting but requires a
-  rotation parameterization the current skeleton output does not
-  produce.
-- Mixed-representation (position + angle concatenated, learned
-  embeddings). These are experiments Paper C might run; they don't
-  belong in Phase 0 infrastructure.
-
-### Provenance subobject
-
-**Status:** `PerformanceMetrics` captures `tensorflow_version`,
-`active_device`, and `tensorflow_metal_active`. Model SHA is not
-computed or propagated. `numpy_version` and `neuropose_version` are
-not recorded. No first-class `Provenance` object.
-
-**Why it matters for Paper C:** reproducibility is the first
-question a reviewer asks of a clinical-methods paper. The answer
-needs to be "same model artifact, same pipeline config, same
-versions, same seeds" — and all four need to be recorded on every
-`results.json` that underlies a paper figure. Not having this means
-either manually tracking it in a lab notebook (fragile, won't
-survive personnel turnover) or running every experiment through a
-pinned Docker image (expensive, doesn't capture runtime
-non-determinism). The subobject is the cheap right answer.
-
-**Scope:**
-
-- New `Provenance` pydantic model in `neuropose.io` with fields:
-  `model_sha256: str`, `model_filename: str`, `tensorflow_version:
-  str`, `tensorflow_metal_version: str | None`,
-  `numpy_version: str`, `neuropose_version: str`, `python_version:
-  str`, `seed: int | None`, `deterministic: bool`, `analysis_config:
-  dict | None` (the YAML of the run if the pipeline was invoked via
-  `neuropose analyze --config`).
-- Optional `provenance: Provenance | None = None` field on
-  `VideoPredictions`, `JobResults`, and `BenchmarkResult`. None-valued
-  on legacy files (enabled by schema migration — see below), populated
-  on every new write.
-- `_model.py` hashes the downloaded tarball on first load (after the
-  existing SHA verification — the two checks use the same hash so
-  compute is amortized) and exposes the hash via a
-  `get_model_sha256()` method on the `Estimator`. `Interfacer._run_job_inner`
-  constructs the `Provenance` and attaches it to the output.
-- Unit test: serialize → JSON → deserialize round-trip identity;
-  assert `model_sha256` matches the SHA recorded in
-  `neuropose._model`.
-
-**Non-scope:**
-
-- Cryptographic signatures on results.json. That's Phase 2 (sigstore
-  on release artifacts) or Track 2 (per-output signing) territory,
-  not Phase 0.
-- Provenance on arbitrary intermediate products (numpy arrays, DTW
-  distance matrices). Top-level JSONs cover Paper C's needs; richer
-  intermediates can inherit from a hand-off if needed.
-
-**Open question:** does Paper C need *per-frame* provenance (which
-frame was processed with which configuration) or just per-job
-provenance? Per-job is enough for reproducibility; per-frame is only
-useful if we want to mix configurations within a single job, which
-has no current use case.
-
-### YAML-configurable analysis pipeline
-
-**Status:** `neuropose.cli`'s `analyze` subcommand is a stub that
-raises `NotImplementedError`. Analysis operations are called
-individually from Python, or via CLI flags on `segment` and
-`benchmark`. No unified representation of "a complete analysis run."
-
-**Why it matters for Paper C:** the paper will run many experimental
-configurations — alignment on/off, per-frame vs per-sequence, raw
-coordinates vs joint angles, full-trial vs cycle-segmented DTW,
-various distance metrics. Each experiment should be reproducible
-from a single file that can be version-controlled, diffed, attached
-to the `Provenance` object, and cited in the paper. A Python script
-full of kwargs is the alternative, and it's exactly the alternative
-the open-source community collectively decided against ten years ago.
-
-This item also resolves the "`neuropose analyze`: ship or remove"
-question that was previously open: we are shipping `analyze`, just
-specifically in a YAML-driven form. The stub that currently exists
-becomes the real command in Phase 0.
-
-**Scope:**
-
-- `AnalysisConfig` pydantic model in `neuropose.analyzer` capturing
-  the full pipeline: input source (predictions file path),
-  preprocessing (`align`, `normalize`, `segment`), per-segment
-  analysis (DTW backend, representation, distance function, extra
-  kwargs), output (figures, statistics, distance matrices).
-- Parseable from YAML via pydantic; validated on parse so typos in
-  field names fail early with a clear error.
-- `neuropose analyze --config experiment.yaml [--output
-  results_042.json]` runs the pipeline end-to-end. The config YAML
-  is serialized into the resulting `Provenance.analysis_config`, so
-  the output file is self-describing.
-- Ship three or four *example* configs under `examples/analysis/`
-  that exercise the full matrix of alignment × representation ×
-  segmentation choices Paper C will care about. Double as integration
-  tests.
-
-**Non-scope:**
-
-- A DAG / workflow engine (Snakemake, Nextflow). A flat config is
-  enough for Paper C's needs; reach for a DAG tool only when
-  experiments have genuine inter-stage dependencies, which analysis
-  of a single video does not.
-- Parallel sweep execution. Run multiple configs via a shell loop
-  for now (`for cfg in examples/analysis/*.yaml; do neuropose
-  analyze --config "$cfg" --output "out/$(basename "$cfg" .yaml).json"; done`).
-  A real sweep orchestrator is Track 2.
-
-**Open question:** should there be a `neuropose analyze compare
-<config_a.yaml> <config_b.yaml>` subcommand that runs both and
-emits a diff figure? Useful for Paper C but not a gating feature —
-post-Phase-0 addition if the need is clear.
-
-### Schema migration for VideoPredictions
-
-**Status:** `VideoPredictions` gained `segmentations: dict[str,
-Segmentation] = Field(default_factory=dict)` during recent work. Old
-JSON files without the field still load (pydantic default-factories
-back-fill), but this is accidental rather than designed-in.
-
-**Why it matters for Paper C:** Paper C will produce analysis results
-over the course of 6-12 months. During that window, Phase 0 work
-itself will evolve — the `Provenance` object will gain fields, the
-`AnalysisConfig` shape will stabilize, maybe the `Segmentation` schema
-will extend. Without migration support, every schema change would
-invalidate some portion of Paper C's already-generated data, forcing
-either a freeze (drops velocity) or a full re-run (wastes compute).
-Migration now is the cheap fix.
-
-**Scope:**
-
-- Add a `schema_version: int = 1` field to `VideoPredictions`,
-  `JobResults`, and `BenchmarkResult` (the three load-anywhere
-  top-level schemas).
-- Write `migrate_video_predictions(payload: dict) -> dict` that
-  takes a raw JSON-loaded dict, dispatches on `schema_version`, and
-  returns a dict conformant with the current version. Default to 1
-  when missing (existing files).
-- Wire it into `load_video_predictions()` so the migration runs
-  before pydantic validation. Log at INFO on migration so users see
-  when files are being upgraded.
-- When writing, always write the current version.
-
-**Non-scope:**
-
-- A general-purpose migration framework. A function that dispatches
-  on an integer is sufficient until we have three versions.
-- In-place migration (writing back the upgraded file). Migrations
-  should run on read; write-back is a separate operator decision.
-
----
-
-## Phase 1: Clinical validation study (Paper C)
-
-Phase 1 is *Paper C itself* — the clinical-methods paper this project
-exists to produce. The content belongs in the paper, in `RESEARCH.md`,
-and in the analysis-config YAMLs under `examples/`, not here. This
-section exists only to demarcate the phase and to capture the
-engineering commitments that should (and should not) happen during it.
-
-**Engineering posture during Phase 1:**
-
-- **Phase 0 is frozen on entry.** Don't refactor the analyzer during
-  Phase 1; refactors invalidate earlier experiments. If a Phase 0
-  shortcoming surfaces during paper-writing, log it in `RESEARCH.md`
-  and revisit after submission.
-- **Phase 2 work is welcome as background.** Writing a launchd plist,
-  wiring up Dependabot, tightening error-path tests — all of this is
-  ideal filler work during the experimental-design and writing
-  phases of Paper C. It consumes different energy than research work
-  does, and the tool is in better shape on submission day as a
-  result.
-- **`RESEARCH.md` gets the bulk of the updates.** Methods decisions,
-  reading-list expansions, reviewer-response notes all live there,
-  not here.
-- **Do add engineering-side notes here** when a Paper C experiment
-  reveals a piece of missing tooling that's worth a Phase 2 item
-  (for example: "we needed batch-analysis across 200 trials and hit
-  this, so Phase 2 should include ..."). Phase 1 is the best
-  possible source of prioritization signal for what Phase 2 is
-  actually worth.
-
-**Prerequisite outside this document:** a MoCap-data-access
-conversation with Dr. Shu. Nothing in Phase 1 can start until that
-conversation has resolved. `RESEARCH.md` §3 flags this as the
-gating question for fine-tuning; it is equally the gating question
-for validation.
-
----
-
-## Phase 2: Coordinated open-source release + Paper A
-
-Phase 2 is the release. Its content is exactly the items listed here
-— the engineering work to take the Phase-0-plus-Phase-1 codebase to a
-state where an outside researcher can pick it up, install it, run it,
-verify its claims, and cite it. It runs concurrently with the tail
-end of Phase 1 (see posture notes above) and culminates in a
-coordinated drop: tag → PyPI → Pages → arXiv / JOSS submission for
-Paper A → reference in Paper C's Code Availability section.
-
-### Release definition
-
-Before enumerating the remaining work, define what "released" means.
-A release candidate should satisfy all of the following:
-
-1. **Installable on a blank machine.** `pip install neuropose` or
-   `uv pip install neuropose` works on both Linux x86_64 and Apple
-   Silicon Mac, with no manual steps beyond Python 3.11.
-2. **Runnable without the author in the room.** The `docs/` site is
-   published somewhere persistent (GitHub Pages, Cloudflare Pages),
-   the getting-started walkthrough actually works end-to-end, and
-   the MeTRAbs model downloads and verifies on first run.
-3. **Verifiable by a reviewer.** CI runs on every push, covers both
-   Linux and macOS, and a PR from a stranger could be meaningfully
-   reviewed without access to the research Mac.
-4. **Honest about its limits.** Every surface the release advertises
-   is either exercised in CI or clearly marked experimental. No
-   false promises in the README or CLI help text. (The `analyze`
-   stub that motivated this item pre-Phase-0 is now real per Phase
-   0's YAML pipeline, so "ship or remove" is no longer open.)
-5. **Versioned.** A git tag exists, `__version__` matches, and
-   `CHANGELOG.md` has a real release section, not just `[Unreleased]`.
-6. **Bundled.** Paper A (tech-stack writeup) and Paper C (clinical
-   validation) cite the release tag, and the release notes cite
-   them. The three artifacts arrive together; reviewers of either
-   paper can find and run the code.
-
-Items below are the gaps between the end-of-Phase-0 state and that
-definition.
-
-### Apple Silicon CI matrix
-
-**Status:** `RESEARCH.md` lists this as an open next step; no
-`macos-14` entry in `.github/workflows/ci.yml`.
-
-**Why it matters for release:** every claim of "Apple Silicon
-support" is currently "by construction" — the TF 2.16+ floor ships
-`darwin/arm64` wheels, the MeTRAbs SavedModel has zero custom ops, and
-therefore it should work. It has not been empirically confirmed on
-real hardware in an automated way. For a public release, we either
-verify in CI or we stop claiming Mac support in the README.
-
-**Scope:**
-
-- Add a `macos-14` matrix entry to the `test` job (lint and typecheck
-  stay single-platform, they're platform-independent).
-- Exclude `slow` markers on macOS so we don't pay the 2 GB model
-  download twice per run.
-- Accept that the first green macOS run may require two or three
-  hotfixes — path case sensitivity, `multiprocessing` spawn vs fork,
-  shared library load order — and budget a day for that.
-- Do **not** add a Metal runner. GitHub's `macos-14` runners don't
-  expose the GPU to TensorFlow in a useful way, and the `[metal]`
-  extra's numerical verification is a separate task that needs real
-  M-series silicon we control.
-
-**Sketch:**
-
-```yaml
-test:
-  strategy:
-    fail-fast: false
-    matrix:
-      os: [ubuntu-latest, macos-14]
-  runs-on: ${{ matrix.os }}
-```
-
-Everything else in the job stays the same; `uv` works identically on
-both platforms.
-
-### Mac hardware validation pass
-
-**Status:** Unexercised. The Shu Lab research Mac (`100.64.15.110`) is
-available; we have an rsync script but no cron job, no automated
-smoke check, no numerical-divergence report against the Linux
-baseline.
-
-**Why it matters for release:** CI on GitHub's `macos-14` runners
-validates that the wheels install and the tests pass on Apple
-Silicon. It does not validate that the real MeTRAbs model loads, that
-inference runs, or that `poses3d` on the Mac matches `poses3d` on
-Linux within a sane tolerance. Those are different questions, and
-answering them against a throwaway runner each time would be wasteful
-and unreliable.
-
-A minimum version of this check — "does `detect_poses` produce
-output on the research Mac at all?" — should happen during Phase 0
-regardless, because Paper C will likely run on the same hardware and
-a silent numerical divergence there would invalidate the paper's
-results. The scope below is the full, release-grade version.
-
-**Scope:**
-
-- Run `neuropose benchmark --compare-cpu` against a reference clip on
-  the research Mac. Capture the resulting `BenchmarkResult` JSON.
-- Commit the JSON as `benchmarks/reference/mac_m3_ultra_cpu_v0_1.json`
-  (a tracked file, not gitignored — this is the reference numerics
-  we'll compare against going forward).
-- Separately, run the `[metal]` path and diff. Record in
-  `RESEARCH.md` whether divergence is within the ~1e-2 mm budget the
-  research notes propose, or whether the Metal path is in the "use at
-  your own risk" column.
-- Document the findings as a new section in `RESEARCH.md` ("Apple
-  Silicon verification, 2026-0X") and close the corresponding
-  open-question entry.
-
-**Open question:** should the reference JSON become a test input
-(slow-marked integration test that re-runs benchmark on a developer's
-machine and asserts divergence from the committed reference), or just
-documentation? The former catches regressions automatically at the
-cost of a 2 GB model download in the slow job; the latter is cheaper
-but easier to ignore.
-
-### Retention and pruning
-
-**Status:** `out/` and `failed/` grow forever. No retention config.
-No `neuropose prune` command.
-
-**Why it matters for release:** a research Mac running the daemon
-unattended for months will fill its disk. The first support request
-will be "the daemon just stopped working" and the answer will be "you
-ran out of disk." We can solve this once now, or a hundred times
-later.
-
-**Scope:**
-
-- Add a `retention_days: int | None = None` setting (default None =
-  disabled, preserving current behavior).
-- When set, the daemon checks on each poll whether any job in
-  `out/` or `failed/` is older than the threshold and removes it. The
-  corresponding `status.json` entry transitions to a new `PRUNED`
-  state (keeping the audit trail) or is removed entirely (keeping the
-  status file small) — pick one and document.
-- Ship a `neuropose prune [--older-than N] [--dry-run]` one-shot
-  command for operators who want manual control.
-- Document in `docs/deployment.md` with a recommended default (30
-  days feels right for benchmark/iteration workflows; clinical
-  deployments would be legal-driven and much longer).
-
-**Open question:** should pruned jobs' `status.json` entries be
-preserved as tombstones (so a user asking "where did job X go?" can
-see "pruned 2026-05-01") or removed entirely? Tombstones are more
-user-friendly; removal keeps the status file bounded. Default to
-tombstones since the status file bound is only a problem at a scale
-the 0.1 release won't hit.
-
-### neuropose doctor preflight
-
-**Status:** Not implemented.
-
-**Why it matters for release:** pydantic-settings validates the
-*schema* of `Settings` (is `device` a valid string, is
-`poll_interval_seconds` positive). It does not validate the
-*environment* — is `data_dir` writable, is the lock file acquirable,
-is `model_cache_dir` on the same filesystem as `data_dir` (so
-`os.rename` works atomically), is the configured TF device actually
-available. Each of those is a runtime failure mode that shows up with
-an ugly traceback ten seconds after `neuropose watch` starts, and
-every one is cheaply detectable at startup.
-
-**Scope:**
-
-- New subcommand `neuropose doctor` that runs a battery of
-  preflight checks and prints a pass/fail table.
-- Checks to include: `data_dir` exists and is writable; lock file
-  acquirable (with clean release); all three subdirectories
-  (`in/out/failed`) writable; `model_cache_dir` writable and on the
-  same filesystem as `data_dir`; TF is importable; configured
-  `device` is in `tf.config.list_physical_devices()`;
-  `tensorflow-metal` either absent or installed with a version that
-  advertises support for the installed TF; XDG envvars are sane;
-  Python version matches `pyproject.toml` floor.
-- Exit code 0 if all checks pass, 1 if any warning, 2 if any fatal
-  failure.
-- The daemon's `run()` entry point calls the same underlying
-  preflight function before entering the poll loop, so
-  `watch`-without-doctor still gets the benefit.
-
-**Non-scope:**
-
-- Do not check for network access to the MeTRAbs download host.
-  Network-dependent checks make CI flaky and don't match the offline
-  caching behavior of real operators.
-
-### Process supervision artifacts
-
-**Status:** `docs/deployment.md` documents a systemd user unit as
-text in prose. No file in `scripts/` that a user can actually copy.
-No macOS launchd plist at all.
-
-**Why it matters for release:** copy-paste from a docs page into a
-`.service` file works, but it's friction. An open-source project with
-"here is the file, here is where it goes, here is the enable command"
-ships deployments faster.
-
-**Scope:**
-
-- Ship `scripts/systemd/neuropose.service` as a file with `%h`
-  placeholders and a short install README.
-- Ship `scripts/launchd/org.levineuwirth.neuropose.plist` as a file
-  with an install README. (Consider making the plist label match the
-  reverse-DNS of whoever is hosting — either the lab's or
-  `org.neuropose.daemon` for a vendor-neutral identity.)
-- Optional: a `scripts/install_service.sh` that detects the platform
-  and runs the right install command. Probably not worth the
-  complexity; a five-line README section per platform is fine.
-
-**Non-scope:**
-
-- Do not write installers for init systems we do not personally run
-  (upstart, sysvinit, runit). If someone needs those, the systemd
-  unit gives them enough of a template.
-
-### Structured logging option
-
-**Status:** Everything logs to stderr via `logging.basicConfig`
-with a human-readable formatter.
-
-**Why it matters for release:** the current format is correct for
-interactive use. For any consumer that wants to feed the daemon's
-output into Loki, Splunk, Grafana, Datadog, or even `jq`-based
-aggregation, JSON-per-line would eliminate a parsing step. This is
-a near-free feature if added now and a disruptive formatting change
-if added later. It is also a prerequisite for any Track 2
-audit-logging work, so building it now keeps Track 2 options open at
-near-zero cost.
-
-**Scope:**
-
-- Add a `--log-format={human,json}` global CLI option defaulting to
-  `human`.
-- Implement the `json` variant as a formatter that emits
-  `{"ts": ..., "level": ..., "logger": ..., "message": ..., ...}` per
-  line with no log-line wrapping.
-- Wire it through `_configure_logging()` so every subcommand benefits
-  identically.
-
-**Open question:** do we also want log correlation IDs per job?
-That's a bigger change (pushing a context var through the
-Interfacer's call stack) and probably Track 2 — skip for 0.1.
-
-### Monitor authentication
-
-**Status:** The monitor binds to `127.0.0.1:8765` by default. No
-auth, no tokens. `--host 0.0.0.0` works but has a comment warning the
-operator to think.
-
-**Why it matters for release:** loopback-only is a reasonable
-default, but the monitor is specifically marketed as the thing
-collaborators can watch. "Collaborator" implies a browser somewhere
-other than the daemon host. The "correct" answer (TLS, real auth) is
-too expensive for 0.1; the "wrong but acceptable" answer (no auth, so
-anyone who can reach the port sees everything) is what we have now.
-There's a middle ground.
-
-**Scope:**
-
-- Add an optional `monitor_token: str | None = None` setting.
-- When set, every request to `/` and `/status.json` must carry
-  `?token=<value>` in the query string or `X-Status-Token` in the
-  header. No token → 401.
-- `neuropose serve` prints a URL including the token on startup, so
-  operators can copy-paste it. If `monitor_token` is unset, behavior
-  is unchanged.
-- `--host 0.0.0.0` emits a stderr warning if `monitor_token` is unset
-  — don't block it, just flag it.
-
-**Non-scope:**
-
-- TLS. Use a reverse proxy (Caddy, nginx, `ssh -L`) for any
-  internet-facing exposure. The monitor is not the right place to
-  terminate TLS.
-- Multi-user auth, session cookies, anything with a database. That's
-  Track 2.
-
-### Docker GPU image
-
-**Status:** `Dockerfile` exists (CPU-only). `Dockerfile.gpu`
-mentioned in CHANGELOG as planned.
-
-**Why it matters for release:** a single-file CUDA deployment story
-reduces "can I run this on our lab server?" from a 45-minute dance
-with conda and CUDA versions to one `docker run`. For Linux GPU
-users this is the friction difference between trying the project and
-bouncing.
-
-**Scope:**
-
-- Write `Dockerfile.gpu` on top of `nvidia/cuda:12.x-runtime-ubuntu22.04`
-  (pick the version TF 2.18 actually supports — check the
-  `tensorflow-gpu` compat matrix, not just "latest").
-- Multi-stage: build stage has `uv` and builds the venv; final stage
-  just copies the venv and sets entrypoints.
-- Add a `docker-build.yml` CI workflow that builds both images on
-  every push to main and publishes as `ghcr.io/neuwirth/neuropose:cpu`
-  and `:gpu` (or wherever the project ends up hosted).
-- Document in `docs/deployment.md` with a `docker run --gpus all`
-  example.
-
-**Non-scope:**
-
-- A `tensorflow-metal` Docker image. Mac can't virtualize Metal, so
-  there's no point.
-
-### Dependency freshness automation
-
-**Status:** No Dependabot, no Renovate. Everything floats until
-somebody notices. The recent TF cap tightening (`<2.19`) was caught
-manually because a user happened to ask; a scheduled bot would have
-flagged it weeks earlier.
-
-**Why it matters for release:** security CVEs on transitive
-dependencies land every few weeks. Without automation, they get
-discovered by a downstream user trying to install into an audited
-environment. With automation, they become a PR you either merge or
-explicitly decline.
-
-**Scope:**
-
-- Add `.github/dependabot.yml` with groups: `python-prod`,
-  `python-dev`, `github-actions`. Weekly schedule. Ignore `tensorflow`
-  updates until manually cleared (the `tensorflow-metal` constraint
-  means auto-bumping TF is destructive).
-- Alternative: Renovate via `renovate.json`. Renovate has better
-  grouping and scheduling, Dependabot is simpler and needs no setup
-  on GitHub. For an open-source Brown-lab project, Dependabot is
-  enough.
-- Add `uv lock --upgrade-package <name>` to the dev playbook in
-  `docs/development.md` so PR authors know how to re-lock.
-
-### Release workflow
-
-**Status:** `[project.scripts]` is wired for `pip install`, but no
-tag-triggered publishing pipeline. `.github/workflows/docs.yml`
-uploads the built docs as a 14-day artifact, not to Pages.
-
-**Why it matters for release:** "release" without a repeatable
-publishing flow is a synonym for "one-off person runs hatch build on
-their laptop at 11pm before the paper deadline." That is not a
-release.
-
-**Scope:**
-
-- `.github/workflows/release.yml` triggered on version tags
-  (`v[0-9]+.[0-9]+.[0-9]+`). Steps: check version matches
-  `__version__`; build with `hatch build`; publish to PyPI via
-  trusted publisher (no long-lived token); create GitHub release with
-  changelog excerpt.
-- Flip `docs.yml` to deploy the `site/` output to GitHub Pages on
-  every push to `main` once the repo is public. Pin the Pages URL in
-  the README and in `site_url` in `mkdocs.yml` (already points at
-  `levineuwirth.github.io`, but verify).
-- Sign tags with GPG; document the key fingerprint in `SECURITY.md`
-  (which does not yet exist; create it).
-- Consider wiring sigstore signing at the same time — see Track 2
-  supply-chain section. Free after the initial setup and buys
-  everything Track 2 would want without committing to the rest of
-  that track.
-
-**Open question:** do we publish under `neuropose`, `brown-neuropose`,
-or something else on PyPI? Whichever name, squat it before the paper
-drops — waiting means risking namesquatter abuse.
-
-### Error-path test coverage expansion
-
-**Status:** Happy paths and a handful of input-validation errors
-covered. Not covered: disk full mid-write, corrupt video mid-decode,
-OOM during inference, fcntl.flock on NFS (no-op on some kernels),
-truncated zip archives, permission denied on data_dir subdirectories.
-
-**Why it matters for release:** shipping a tool where "happy path
-works" is different from shipping a tool where "when it fails, it
-fails predictably." For a clinical research pipeline where a crash
-mid-job quarantines valuable recording data, fault tolerance is a
-feature.
-
-**Scope:**
-
-- Systematic pass: for each module, write a `test_<module>_failure_modes.py`
-  enumerating the specific exception classes that can escape and the
-  corresponding test case that triggers each one. Use `pytest.raises`
-  with the exact expected exception class.
-- Hardest cases use fixtures that monkeypatch system calls
-  (`os.write` raises OSError(ENOSPC), `cv2.VideoCapture.read` returns
-  `False, None` partway through, `fcntl.flock` raises OSError(EBADF)).
-- Aim: every user-facing error message in the codebase has a test
-  that proves it's reachable.
-
-**Non-scope:**
-
-- Chaos-engineering frameworks. `monkeypatch` is enough.
-- Covering unrecoverable errors like SIGKILL of the daemon mid-frame.
-  That's the recovery-on-startup test, which already exists.
-
----
-
-## Track 2: Clinical platform (contingent)
-
-Track 2 is everything beyond the open-source research tool —
-multi-tenancy, audit logging, HTTP/API layer, clinician UI,
-clinical-system integrations, the works. None of it is sequenced
-with Phases 0–2; all of it is gated on specific triggers that don't
-exist yet.
-
-### Triggers to activate Track 2
-
-Do not start Track 2 work until at least one of the following is
-true:
-
-1. **External demand.** Another clinical group has asked for a
-   deployment they can run independently. Not a casual "interesting
-   project" — a specific ask with a specific cohort and a specific
-   timeline.
-2. **Multi-site ambition.** The Shu Lab decides to run NeuroPose
-   across more than one site within Brown-affiliated clinical
-   systems, and the single-host assumption stops working.
-3. **Funding mandate.** A grant or contract specifies outputs that
-   the Phase 0-1-2 deliverables cannot meet (e.g. "produce a
-   HIPAA-compliant deployment," "integrate with the EHR").
-4. **Publication traction.** Papers A and C get engagement that
-   translates into demand for a hosted version. Clinical-methods
-   papers occasionally do. If enough unsolicited inquiries land,
-   Track 2 becomes worth the investment.
-
-Before at least one of these triggers: everything below is
-background thinking, not planned work. *Do not refactor Phase 0 or
-Phase 2 code to make Track 2 easier.* Every such refactor is a bet
-on a future that may not arrive.
-
-### Multi-tenancy and identity
-
-**What it would require:**
-
-- A concept of "user" distinct from "OS user." Today `Settings.data_dir`
-  is one directory per OS user; multi-tenancy means one `data_dir`
-  serving many logical tenants with enforced isolation.
-- Per-tenant namespacing in `in/`, `out/`, `failed/`, and
-  `status.json`. Cleanest is one subdirectory per tenant with the
-  same four-directory layout; the daemon's discovery logic becomes a
-  two-level scan.
-- Authentication on the control plane. Passing tenant identity as a
-  command-line arg is fine for a research prototype; a real
-  deployment needs OAuth/OIDC or SAML with the institution's IdP
-  (Brown CAS, epic Auth, whatever the target site uses).
-- Authorization model: at minimum, "tenant A cannot see tenant B's
-  jobs." For clinical deployments, probably also role-based (clinician
-  / PI / admin / auditor).
-
-**Cheapest path forward if a trigger fires:** fork the data-directory
-layout into `$data_dir/<tenant_id>/{in,out,failed,status.json}`,
-teach the daemon to iterate tenants in its poll loop, add a
-`--tenant` flag to the CLI. That's enough for an invitation-only
-deployment where tenants are identified by opaque string and issued
-out-of-band.
-
-**Expensive path:** anything involving an identity provider. Don't
-go there without a real operator committing to the deployment.
-
-### Audit logging and compliance posture
-
-**What it would require:**
-
-- Append-only log of every data access, write, and configuration
-  change, with actor identity and timestamp. Separate from the
-  application log (which rotates).
-- Logs streamed to a write-once sink (S3 with object-lock,
-  immutable journal) so a compromised host can't rewrite the
-  trail.
-- Legal review: what exactly does HIPAA require of this tool? What
-  about institutional IRB? The answer will differ across sites and
-  the project cannot prescribe it — but the *capability* to generate
-  the required logs needs to be built in.
-- Retention policy wired to the audit log, not just application
-  state. Pruning job results is different from pruning audit records.
-
-**Technical prerequisite:** structured logging from Phase 2 (which
-is a cheap add and is scheduled anyway). Without JSON-per-line logs,
-audit extraction is a grep-and-pray regex problem.
-
-### HTTP/API layer
-
-**What it would require:**
-
-- Today the control plane is "write files to `in/`." For a
-  non-filesystem-native consumer (a hosted web UI, a batch scheduler,
-  a Jupyter kernel in a different container), an HTTP API is the
-  right abstraction.
-- FastAPI or Litestar on top of the existing ingest/interfacer/io
-  modules. The daemon becomes a long-running process that serves
-  requests *and* processes the input directory; or the daemon stays
-  headless and the HTTP layer is a separate process talking via the
-  same filesystem contract.
-- OpenAPI schema published as part of the release so client code can
-  be generated.
-
-**Non-obvious pitfall:** the daemon's fcntl-based single-instance
-lock assumes one writer. If the HTTP layer is a separate process, it
-needs to go through the same ingest API, not directly into `in/`.
-That's an easy discipline to establish if designed in from day one,
-a painful refactor later.
-
-**Cheap Phase 0/2 precaution:** keep `neuropose.ingest` and
-`neuropose.interfacer` API-stable as Python modules. If a future
-HTTP layer imports them, we don't want to break the import.
-
-### Clinician-facing UI
-
-**What it would require:**
-
-- More than the `neuropose serve` dashboard — an actual web
-  application with clinician-facing views: patient list, session
-  list, session-level pose visualization, comparison against
-  reference motion, exportable reports.
-- Probably React + TypeScript on the frontend, consuming the HTTP
-  API above. Backend-rendered templates would be faster to build but
-  a worse fit for the per-session interaction model clinicians
-  expect.
-- WebGL or Three.js for 3D pose playback. The `neuropose.visualize`
-  module is a matplotlib-based still-frame tool; rebuilding it for
-  interactive 3D is a weeks-to-months project on its own.
-- Accessibility: clinician environments include keyboard-only users,
-  users on institutional IE holdovers (yes, still), users with
-  screen readers. A research-grade UI ignores this; a clinical-grade
-  one cannot.
-
-**Scope is enormous.** This is the single largest piece of Track 2
-and would likely dwarf all other Track 2 work combined. Would not
-start without dedicated frontend engineering effort.
-
-### Horizontal scaling
-
-**What it would require:**
-
-- A message broker (Redis Streams, RabbitMQ, or NATS) in place of the
-  filesystem poll. Each job becomes a broker message; multiple
-  worker processes consume and process in parallel.
-- Shared storage for inputs and outputs (S3, MinIO, NFS). The
-  "job_name is a directory" contract generalizes to "job_name is an
-  object prefix."
-- Per-worker GPU affinity for the multi-GPU case; worker auto-sizing
-  based on queue depth.
-- Distributed lock for the leader-only work (status file writes,
-  retention enforcement).
-
-**Upgrade path that minimizes pain:** the current single-process
-daemon is equivalent to the "one worker" case of a horizontal
-deployment. If the job object in `neuropose.io` stays the source of
-truth (not the filesystem layout), the transition is backend-swap,
-not architectural surgery. Keep that option open by treating the
-filesystem as an implementation detail of `Interfacer`, not a public
-contract.
-
-### Backup, replication, and data durability
-
-**What it would require:**
-
-- Outputs (`out/<job>/results.json`) currently live on one disk on
-  one host. For clinical data this is insufficient durability.
-- Replication target: another host (hot standby), object storage
-  (warm archive), or both. The `out/` directory is the canonical
-  store; replicating it periodically is a scriptable cron job today.
-- Proper replication: as writes happen, not as a cron. Either a
-  daemon-side hook that PUTs to S3 immediately after each
-  `save_job_results`, or a sidecar process watching the filesystem
-  with `inotify`/`fswatch`.
-- Restore story: how do we restore `out/` from backup without
-  breaking `status.json` (which refers to job names by convention)?
-  Test this annually.
-
-**Minimum viable backup for Phase 2:** add a `scripts/backup.sh`
-that rsyncs `$data_dir/out/` to a configurable destination. Not a
-feature; a paving-the-path-for-operators artifact.
-
-### Clinical-system integrations
-
-**What it would require:**
-
-- **DICOM** if videos are stored as DICOM instances rather than
-  MP4. Clinical motion-analysis devices increasingly output DICOM
-  video; reading DICOM means `pydicom` + some decoding logic.
-- **FHIR** for patient metadata. If NeuroPose is to accept a
-  patient ID and attach it to a session, that ID probably comes
-  from a FHIR Patient resource. Means speaking FHIR to the hospital's
-  FHIR endpoint (Epic, Cerner).
-- **Redcap** integration for clinical-research cohorts (the Brown
-  ecosystem uses it heavily). An export script that pulls subject
-  metadata from a RedCap project and lays it into the ingest
-  directory is cheap and valuable.
-
-**Order of likely need:** RedCap first (easy, valuable, Brown-local),
-then DICOM (depends on what the recording device outputs), then
-FHIR (only if we're pulling from an EHR, which we probably aren't
-for research).
-
-### Deterministic inference mode
-
-**What it would require:**
-
-- Phase 0's `Provenance` object already captures model SHA, TF
-  version, NumPy version, and a seed field. The missing piece for
-  strict reproducibility is forcing TensorFlow itself to behave
-  deterministically —
-  `tf.config.experimental.enable_op_determinism()` plus seeding all
-  of `random`, `numpy.random`, and `tf.random`.
-- A `deterministic: bool = False` setting on `Settings` that flips
-  the above. Default off, because deterministic mode costs a
-  meaningful fraction of throughput on GPUs and isn't free on CPUs
-  either. Clinical deployments would turn it on; benchmark runs
-  would turn it off.
-- A `Provenance.deterministic` boolean field is already in the Phase
-  0 scope; this item closes the loop by actually making that
-  boolean mean something.
-
-**Cheap Phase 2 precaution:** wire the setting in Phase 2 even if we
-don't flip it on by default. Future Track 2 deployments can flip it
-without a code change.
-
-### Observability and SLOs
-
-**What it would require:**
-
-- Prometheus metrics endpoint (separate port from the monitor, no
-  auth needed on metrics, loopback or behind a scraper only).
-- Counters: jobs_processed, jobs_failed, frames_processed, bytes_read,
-  bytes_written. Histograms: per-frame latency, per-job latency,
-  per-video latency. Gauges: queue depth, active job count.
-- Tracing: OpenTelemetry instrumentation on job_process,
-  detect_poses, save_job_results. Again, the interesting spans are
-  the long ones, so trace-sampling at 100% is usually fine until
-  throughput matters.
-- Defined SLOs: "99% of jobs complete within 10× video duration,"
-  "95% of monitor requests return in under 100 ms," etc.
-  SLO definitions go into a `docs/slos.md`; burn-rate alerting is
-  the operational half.
-
-**Order-of-magnitude** dependency: none of this is useful without
-Track 2 demand. A single-user research Mac doesn't have SLOs.
-
-### Supply-chain attestation and signed releases
-
-**What it would require:**
-
-- SBOM generation on every release (CycloneDX or SPDX format,
-  attached to the GitHub release and published alongside the wheel).
-- Signed releases: sigstore / cosign signatures on the wheel, the
-  Docker images, and the source tarball. GitHub's OIDC +
-  sigstore makes this a ten-line workflow once. For a clinical tool,
-  a reviewer being able to verify "this wheel is the one GitHub
-  Actions produced from this commit" is non-negotiable.
-- Reproducible builds: same source → same wheel hash. Python wheels
-  are usually reproducible with `SOURCE_DATE_EPOCH` set and `.pyc`
-  exclusion; document the exact command.
-- Provenance attestations (SLSA level 2 or 3) for the CI pipeline.
-  GitHub's `attestations/build-provenance` action does this.
-
-**Cheapest Phase 2 precaution:** wire sigstore signing into the
-release workflow when it's first built (see Phase 2 release workflow
-section). Free after the initial setup.
-
-### Deployment orchestration
-
-**What it would require:**
-
-- Kubernetes manifests (Helm chart, probably). Pod specs for the
-  daemon, the monitor, the HTTP API. Separate deployments so they
-  can scale independently.
-- Terraform or Pulumi for the underlying infrastructure: GPU
-  node pool, object storage, IAM, TLS termination. Site-dependent;
-  Brown runs primarily on-prem with some AWS — the IaC would need
-  to target both.
-- Secrets management: Vault, AWS Secrets Manager, or K8s
-  Secrets + External Secrets Operator. The monitor token, the
-  broker credentials, the object-storage keys all need to stop being
-  env vars in a `.service` file.
-
-**Strong recommendation:** do not write any of this until there is
-a specific deployment with specific operators. Generic K8s manifests
-written without a target are a solution in search of a problem, and
-they age fast.
-
----
-
-## Decisions to not prematurely foreclose
-
-A short list of choices we should avoid making in Phase 0 or Phase 2
-that would make Track 2 more expensive later:
-
-1. **Keep `neuropose.ingest` and `neuropose.interfacer` API-stable
-   as Python modules.** A future HTTP layer should be able to import
-   them. Avoid adding `@staticmethod` decorators that hide internal
-   state; avoid coupling to global config.
-2. **Keep the filesystem layout reversible.** Anything in
-   `$data_dir` that is not a user artifact should be treated as
-   internal. If Track 2 wants to replace the filesystem with an
-   object store, the daemon's only file I/O should be via
-   `neuropose.io` helpers — no raw opens scattered through the code.
-3. **Keep `VideoPredictions.provenance` extensible.** The Phase 0
-   `Provenance` model should be a pydantic model so fields can be
-   added backward-compatibly. Don't pack provenance into free-form
-   strings or nested dicts that require bespoke parsing.
-4. **Keep the CLI subcommands orthogonal.** Do not add subcommands
-   that wrap multiple other subcommands for convenience; that
-   creates API shape we'd regret if the right composition layer
-   later is HTTP, not shell.
-5. **Keep model loading behind `neuropose._model`.** A future
-   self-hosted model registry, signed-artifact verification, or
-   multi-model switching should be a change in one file, not a
-   refactor across the estimator.
-6. **Keep `Settings` the single source of truth.** No `os.environ`
-   reads outside pydantic-settings; no sprinkled `Path.home()`
-   calls. Track 2 almost certainly overrides configuration from
-   a secret store, and if that override has one place to hook in,
-   it's easy.
-7. **Keep status-file schema owned by pydantic, not hand-written
-   JSON.** Track 2 multi-tenancy means indexing into the status
-   file by tenant; a pydantic model refactor is cheap, a
-   hand-written dict refactor is not.
-8. **Keep the `AnalysisConfig` shape additive.** The Phase 0 YAML
-   schema will evolve through Phase 1 as Paper C's experiments
-   surface needs. Additions are free (new optional fields);
-   renames and removals invalidate prior experiments. Pydantic's
-   `extra="forbid"` catches typos at parse time while still
-   allowing additive extension.
-
-These are cheap-now / expensive-later items. Every other Track 2
-decision can wait for a real trigger.