neuropose/TECHNICAL.md

# NeuroPose Technical Ideation Notes

A living engineering roadmap, parallel to `RESEARCH.md`. Where
`RESEARCH.md` captures open methodological questions (DTW, skeleton
choice, hosting the model), this document captures open *engineering*
questions — release readiness, operability, scaling — and the paths
they could take.

This is **not** user-facing documentation. Items here are *candidates*
for future work, and inclusion does not imply commitment.

## How to use this document

- Add a section when you start thinking about a new area of technical
  investment.
- Each section should end with a **Scope**, **Sketch**, or **Open
  questions** block so it's obvious to a future you (or a new
  contributor) what the concrete next move would be.
- When an item in here is decided and implemented, move it to the
  relevant place in `docs/` or in the code itself, and leave a short
  pointer behind (*See `docs/deployment.md` for the resolved design.*).
- The audience is anyone maintaining the codebase — Levi, David,
  Praneeth, Dr. Shu, and whoever comes after us. Assume competence in
  Python and systems work; don't assume familiarity with our specific
  tooling choices.

## Three phases, then a contingent track

There are four distinct technical objectives, ordered by timeline and
by what each enables next. The sequencing is deliberate: each phase
unblocks the next, and doing them in any other order either publishes
Paper C on top of a pipeline its own design notes disavow, or delays
the open-source release past the window where the accompanying paper
is still salient.

1. **Phase 0 — C-enabling pipeline work.** A targeted subset of
   engineering work that has to land *before* Paper C can start. The
   DTW defaults shipped in 0.1 are explicitly a "mechanical port, not
   a methodological choice" (see `RESEARCH.md` §1); running the
   clinical validation study on them would mean publishing results
   from a pipeline the accompanying design notes explicitly criticize.
   Phase 0 fixes the analyzer's methodological foundations (Procrustes
   preprocessing, cycle segmentation, joint-angle DTW representation),
   locks in the reproducibility surface (`Provenance` subobject,
   YAML-configurable analysis pipeline), and sets up schema migration
   so data generated during Phase 1 survives the long write-up.
   **Near-term, well-scoped, weeks of work.**

2. **Phase 1 — Paper C: clinical validation study.** The planned
   clinical-methods paper: cycle-aware joint-angle DTW for clinical
   gait similarity, validated against MoCap ground truth and/or
   clinician ratings. Gated on MoCap data access via Dr. Shu. This is
   research work, not engineering work — this document describes the
   engineering scaffolding *around* it, not the paper itself. Phase 2
   work can happen in the background during this phase as ideal filler
   for research-burnout cycles. **Months; timeline driven by data
   access and experimental design.**

3. **Phase 2 — Coordinated open-source release + Paper A.** The
   engineering-paper companion (A) describing the tech stack, plus
   the tagged 0.1 release: PyPI publication, docs deployment, Docker
   images, CI matrix, supervision artifacts, doctor preflight, all
   the operational items that make the tool credible to external
   users. Timed to arrive *with or slightly before* Paper C's
   submission, producing a paper-plus-tool bundle that reviewers can
   actually run. **Weeks of work, timing driven by Paper C's
   submission window.**

4. **Track 2 — Clinical platform (contingent).** Everything beyond
   the open-source research tool — multi-tenancy, audit logging,
   HTTP/API layer, clinician UI, clinical-system integrations. Not
   sequenced; activates only if specific triggers fire (external
   demand, multi-site ambition, funding mandate, publication
   traction). Most of this is background thinking, not planned work.
   The value of keeping it in this document is so that Phase 0 and
   Phase 2 decisions don't accidentally foreclose Track 2 options.

Phases 0 → 1 → 2 form a near-term sequence that culminates in a
paper-plus-release bundle. Track 2 sits outside that sequence and
does not gate any of it.

## Contents

- [Phase 0: C-enabling pipeline work](#phase-0-c-enabling-pipeline-work)
  - [Procrustes preprocessing](#procrustes-preprocessing)
  - [Gait cycle segmentation](#gait-cycle-segmentation)
  - [Joint-angle DTW representation](#joint-angle-dtw-representation)
  - [Provenance subobject](#provenance-subobject)
  - [YAML-configurable analysis pipeline](#yaml-configurable-analysis-pipeline)
  - [Schema migration for VideoPredictions](#schema-migration-for-videopredictions)
- [Phase 1: Clinical validation study (Paper C)](#phase-1-clinical-validation-study-paper-c)
- [Phase 2: Coordinated open-source release + Paper A](#phase-2-coordinated-open-source-release--paper-a)
  - [Release definition](#release-definition)
  - [Apple Silicon CI matrix](#apple-silicon-ci-matrix)
  - [Mac hardware validation pass](#mac-hardware-validation-pass)
  - [Retention and pruning](#retention-and-pruning)
  - [neuropose doctor preflight](#neuropose-doctor-preflight)
  - [Process supervision artifacts](#process-supervision-artifacts)
  - [Structured logging option](#structured-logging-option)
  - [Monitor authentication](#monitor-authentication)
  - [Docker GPU image](#docker-gpu-image)
  - [Dependency freshness automation](#dependency-freshness-automation)
  - [Release workflow](#release-workflow)
  - [Error-path test coverage expansion](#error-path-test-coverage-expansion)
- [Track 2: Clinical platform (contingent)](#track-2-clinical-platform-contingent)
  - [Triggers to activate Track 2](#triggers-to-activate-track-2)
  - [Multi-tenancy and identity](#multi-tenancy-and-identity)
  - [Audit logging and compliance posture](#audit-logging-and-compliance-posture)
  - [HTTP/API layer](#httpapi-layer)
  - [Clinician-facing UI](#clinician-facing-ui)
  - [Horizontal scaling](#horizontal-scaling)
  - [Backup, replication, and data durability](#backup-replication-and-data-durability)
  - [Clinical-system integrations](#clinical-system-integrations)
  - [Deterministic inference mode](#deterministic-inference-mode)
  - [Observability and SLOs](#observability-and-slos)
  - [Supply-chain attestation and signed releases](#supply-chain-attestation-and-signed-releases)
  - [Deployment orchestration](#deployment-orchestration)
- [Decisions to not prematurely foreclose](#decisions-to-not-prematurely-foreclose)

---

## Phase 0: C-enabling pipeline work

The six items below are prerequisites for Paper C. Until they are
landed, every analysis C would produce would be running on defaults
that `RESEARCH.md` §1 explicitly flags as provisional. Ship these
first, in any order that suits the implementer's cadence, and the
rest of the project can pick up with confidence that Phase 1 results
are trustworthy.

### Procrustes preprocessing

**Status:** Not implemented. `neuropose.analyzer.features` ships
`extract_joint_angles` and feature-statistics helpers; no alignment
step exists between pose sequences.

**Why it matters for Paper C:** without alignment, DTW distance is
translation- and orientation-dependent. Two recordings of the same
subject from different camera positions produce different distances,
which is almost never what a clinician wants. Paper C's methods
section would need to apologize for this in print; cheaper to fix the
method than to defend it.

**Scope:**

- Add `procrustes_align(a: np.ndarray, b: np.ndarray, *, mode:
  Literal["per_frame", "per_sequence"]) -> tuple[np.ndarray,
  np.ndarray, AlignmentDiagnostics]` to `neuropose.analyzer.features`.
  Implements the Kabsch algorithm (closed-form optimal rigid
  transform). Per-frame aligns each frame of A to the corresponding
  frame of B independently; per-sequence computes one transform over
  the whole sequence. Both are useful — per-frame for fine-grained
  matching, per-sequence for preserving within-trial dynamics.
- Return aligned arrays plus an `AlignmentDiagnostics` dataclass with
  the fitted rotation magnitude and translation magnitude so
  downstream code can flag suspiciously large transforms (usually a
  sign of upstream annotation error).
- Expose as an opt-in `align: Literal["none", "procrustes_per_frame",
  "procrustes_per_sequence"] = "none"` parameter on every DTW entry
  point in `neuropose.analyzer.dtw`. Default `none` preserves current
  behavior; Paper C's pipeline sets it to `procrustes_per_sequence`.
- Unit tests: construct a known rotation + translation between two
  synthetic skeletons, verify alignment recovers it to within
  floating-point precision; verify alignment of a sequence with its
  own translated copy produces zero residual.

**Non-scope:**

- Non-rigid alignment (thin-plate splines, learned registration). Not
  needed for skeleton-level comparison and would be a research
  contribution on its own.

**Open question:** should alignment also include optional scaling
(scaled-Procrustes / full Procrustes)? For cross-subject comparison
it almost certainly should. Default to scale-preserving and add a
`scale: bool = False` flag; Paper C can flip it on for cross-subject
figures.

### Gait cycle segmentation

**Status:** `segment_by_peaks` in `neuropose.analyzer.segment`
performs generic valley-to-valley segmentation on a supplied 1D
signal. There is no gait-specific wrapper that knows to look at the
heel's vertical coordinate.

**Why it matters for Paper C:** clinical gait analysis wants to
compare *the 4th heel-strike of trial A* to *the 4th heel-strike of
trial B*, not *frame 120 of A vs frame 120 of B*. Per-cycle DTW is
the standard approach in the biomechanics literature (Sadeghi et al.
2000 and descendants); running full-trial DTW on gait is a choice
reviewers of Paper C would correctly flag as methodologically weak.

**Scope:**

- New `segment_gait_cycles(predictions: VideoPredictions, *, joint:
  str = "rhee", axis: Literal["x", "y", "z"] = "y", min_cycle_seconds:
  float = 0.4) -> Segmentation` in `neuropose.analyzer.segment`.
- Under the hood: extract the specified joint's coordinate along the
  specified axis, apply `segment_by_peaks` with appropriate distance
  and prominence thresholds (derived from `min_cycle_seconds` via
  `predictions.metadata.fps`), return the resulting `Segmentation`
  (the existing `neuropose.io.Segmentation` type) so downstream
  tooling picks it up unchanged.
- Two-sided detection: run the same detection on the opposite heel
  and return *both* per-side segmentations under named keys
  (`left_heel_strikes`, `right_heel_strikes`). Clinical users will
  want both.
- Allow the reference joint and axis to be configurable so trials
  recorded with a different camera orientation (lateral vs frontal
  vs oblique) can still be segmented without a code change.

**Non-scope:**

- HMM-based cycle detection, learned cycle detectors. Peak detection
  on vertical coordinate is standard, well-understood, and the
  method the biomechanics literature expects to see.
- Handling pathological gaits where heel-strikes are absent
  (shuffling, walker-assisted). The function should degrade
  gracefully (return a `Segmentation` with an empty list, not raise),
  and Paper C's data-quality filtering handles the rest.

**Open question:** should the function also emit a "confidence" per
cycle (prominence of the detected peak, regularity of spacing) that
Paper C can use to filter out low-quality detections? Cheap to add,
useful downstream. Recommend yes.

### Joint-angle DTW representation

**Status:** `dtw_all`, `dtw_per_joint`, and `dtw_relation` operate on
raw 3D coordinates or joint-pair displacements. `extract_joint_angles`
produces per-frame angle sequences but is not wired as a DTW input.

**Why it matters for Paper C:** angle-space DTW is translation- and
rotation-invariant by construction, scale-invariant if normalized,
and directly interpretable in clinical terms ("knee flexion angle
during swing phase"). Paper C's headline figures almost certainly
use angle-space distances; raw coordinates would draw the obvious
reviewer question of why we aren't comparing the thing clinicians
actually measure.

**Scope:**

- Add `representation: Literal["coords", "angles", "relation"] =
  "coords"` to every DTW entry point. The `coords` default preserves
  existing behavior; `angles` runs `extract_joint_angles` on each
  input first; `relation` is the existing `dtw_relation` path
  expressed as a representation choice rather than a separate
  function (leaving the `dtw_relation` name as a convenience wrapper
  if preferred).
- Degenerate-vector handling: `extract_joint_angles` returns NaN for
  degenerate (zero-length) vectors. The DTW path needs to decide how
  to handle NaN — skip-and-interpolate, drop, or propagate to the
  distance. Propagation is safest (makes the problem visible);
  interpolation is what clinical users probably want day-to-day.
  Default to propagation and expose `nan_policy: Literal["propagate",
  "interpolate", "drop"]` for experimentation.
- Tests: synthetic pair with known angular difference, assert DTW in
  angle-space recovers it independent of global rotation applied to
  the input.

**Non-scope:**

- Quaternion or SO(3) rotation-space DTW. Interesting but requires a
  rotation parameterization the current skeleton output does not
  produce.
- Mixed-representation (position + angle concatenated, learned
  embeddings). These are experiments Paper C might run; they don't
  belong in Phase 0 infrastructure.

### Provenance subobject

**Status:** `PerformanceMetrics` captures `tensorflow_version`,
`active_device`, and `tensorflow_metal_active`. Model SHA is not
computed or propagated. `numpy_version` and `neuropose_version` are
not recorded. No first-class `Provenance` object.

**Why it matters for Paper C:** reproducibility is the first
question a reviewer asks of a clinical-methods paper. The answer
needs to be "same model artifact, same pipeline config, same
versions, same seeds" — and all four need to be recorded on every
`results.json` that underlies a paper figure. Not having this means
either manually tracking it in a lab notebook (fragile, won't
survive personnel turnover) or running every experiment through a
pinned Docker image (expensive, doesn't capture runtime
non-determinism). The subobject is the cheap right answer.

**Scope:**

- New `Provenance` pydantic model in `neuropose.io` with fields:
  `model_sha256: str`, `model_filename: str`, `tensorflow_version:
  str`, `tensorflow_metal_version: str | None`,
  `numpy_version: str`, `neuropose_version: str`, `python_version:
  str`, `seed: int | None`, `deterministic: bool`, `analysis_config:
  dict | None` (the YAML of the run if the pipeline was invoked via
  `neuropose analyze --config`).
- Optional `provenance: Provenance | None = None` field on
  `VideoPredictions`, `JobResults`, and `BenchmarkResult`. None-valued
  on legacy files (enabled by schema migration — see below), populated
  on every new write.
- `_model.py` hashes the downloaded tarball on first load (after the
  existing SHA verification — the two checks use the same hash so
  compute is amortized) and exposes the hash via a
  `get_model_sha256()` method on the `Estimator`. `Interfacer._run_job_inner`
  constructs the `Provenance` and attaches it to the output.
- Unit test: serialize → JSON → deserialize round-trip identity;
  assert `model_sha256` matches the SHA recorded in
  `neuropose._model`.

**Non-scope:**

- Cryptographic signatures on results.json. That's Phase 2 (sigstore
  on release artifacts) or Track 2 (per-output signing) territory,
  not Phase 0.
- Provenance on arbitrary intermediate products (numpy arrays, DTW
  distance matrices). Top-level JSONs cover Paper C's needs; richer
  intermediates can inherit from a hand-off if needed.

**Open question:** does Paper C need *per-frame* provenance (which
frame was processed with which configuration) or just per-job
provenance? Per-job is enough for reproducibility; per-frame is only
useful if we want to mix configurations within a single job, which
has no current use case.

### YAML-configurable analysis pipeline

**Status:** `neuropose.cli`'s `analyze` subcommand is a stub that
raises `NotImplementedError`. Analysis operations are called
individually from Python, or via CLI flags on `segment` and
`benchmark`. No unified representation of "a complete analysis run."

**Why it matters for Paper C:** the paper will run many experimental
configurations — alignment on/off, per-frame vs per-sequence, raw
coordinates vs joint angles, full-trial vs cycle-segmented DTW,
various distance metrics. Each experiment should be reproducible
from a single file that can be version-controlled, diffed, attached
to the `Provenance` object, and cited in the paper. A Python script
full of kwargs is the alternative, and it's exactly the alternative
the open-source community collectively decided against ten years ago.

This item also resolves the "`neuropose analyze`: ship or remove"
question that was previously open: we are shipping `analyze`, just
specifically in a YAML-driven form. The stub that currently exists
becomes the real command in Phase 0.

**Scope:**

- `AnalysisConfig` pydantic model in `neuropose.analyzer` capturing
  the full pipeline: input source (predictions file path),
  preprocessing (`align`, `normalize`, `segment`), per-segment
  analysis (DTW backend, representation, distance function, extra
  kwargs), output (figures, statistics, distance matrices).
- Parseable from YAML via pydantic; validated on parse so typos in
  field names fail early with a clear error.
- `neuropose analyze --config experiment.yaml [--output
  results_042.json]` runs the pipeline end-to-end. The config YAML
  is serialized into the resulting `Provenance.analysis_config`, so
  the output file is self-describing.
- Ship three or four *example* configs under `examples/analysis/`
  that exercise the full matrix of alignment × representation ×
  segmentation choices Paper C will care about. Double as integration
  tests.

**Non-scope:**

- A DAG / workflow engine (Snakemake, Nextflow). A flat config is
  enough for Paper C's needs; reach for a DAG tool only when
  experiments have genuine inter-stage dependencies, which analysis
  of a single video does not.
- Parallel sweep execution. Run multiple configs via a shell loop
  for now (`for cfg in examples/analysis/*.yaml; do neuropose
  analyze --config "$cfg" --output "out/$(basename "$cfg" .yaml).json"; done`).
  A real sweep orchestrator is Track 2.

**Open question:** should there be a `neuropose analyze compare
<config_a.yaml> <config_b.yaml>` subcommand that runs both and
emits a diff figure? Useful for Paper C but not a gating feature —
post-Phase-0 addition if the need is clear.

### Schema migration for VideoPredictions

**Status:** `VideoPredictions` gained `segmentations: dict[str,
Segmentation] = Field(default_factory=dict)` during recent work. Old
JSON files without the field still load (pydantic default-factories
back-fill), but this is accidental rather than designed-in.

**Why it matters for Paper C:** Paper C will produce analysis results
over the course of 6-12 months. During that window, Phase 0 work
itself will evolve — the `Provenance` object will gain fields, the
`AnalysisConfig` shape will stabilize, maybe the `Segmentation` schema
will extend. Without migration support, every schema change would
invalidate some portion of Paper C's already-generated data, forcing
either a freeze (drops velocity) or a full re-run (wastes compute).
Migration now is the cheap fix.

**Scope:**

- Add a `schema_version: int = 1` field to `VideoPredictions`,
  `JobResults`, and `BenchmarkResult` (the three load-anywhere
  top-level schemas).
- Write `migrate_video_predictions(payload: dict) -> dict` that
  takes a raw JSON-loaded dict, dispatches on `schema_version`, and
  returns a dict conformant with the current version. Default to 1
  when missing (existing files).
- Wire it into `load_video_predictions()` so the migration runs
  before pydantic validation. Log at INFO on migration so users see
  when files are being upgraded.
- When writing, always write the current version.

**Non-scope:**

- A general-purpose migration framework. A function that dispatches
  on an integer is sufficient until we have three versions.
- In-place migration (writing back the upgraded file). Migrations
  should run on read; write-back is a separate operator decision.

---

## Phase 1: Clinical validation study (Paper C)

Phase 1 is *Paper C itself* — the clinical-methods paper this project
exists to produce. The content belongs in the paper, in `RESEARCH.md`,
and in the analysis-config YAMLs under `examples/`, not here. This
section exists only to demarcate the phase and to capture the
engineering commitments that should (and should not) happen during it.

**Engineering posture during Phase 1:**

- **Phase 0 is frozen on entry.** Don't refactor the analyzer during
  Phase 1; refactors invalidate earlier experiments. If a Phase 0
  shortcoming surfaces during paper-writing, log it in `RESEARCH.md`
  and revisit after submission.
- **Phase 2 work is welcome as background.** Writing a launchd plist,
  wiring up Dependabot, tightening error-path tests — all of this is
  ideal filler work during the experimental-design and writing
  phases of Paper C. It consumes different energy than research work
  does, and the tool is in better shape on submission day as a
  result.
- **`RESEARCH.md` gets the bulk of the updates.** Methods decisions,
  reading-list expansions, reviewer-response notes all live there,
  not here.
- **Do add engineering-side notes here** when a Paper C experiment
  reveals a piece of missing tooling that's worth a Phase 2 item
  (for example: "we needed batch-analysis across 200 trials and hit
  this, so Phase 2 should include ..."). Phase 1 is the best
  possible source of prioritization signal for what Phase 2 is
  actually worth.

**Prerequisite outside this document:** a MoCap-data-access
conversation with Dr. Shu. Nothing in Phase 1 can start until that
conversation has resolved. `RESEARCH.md` §3 flags this as the
gating question for fine-tuning; it is equally the gating question
for validation.

---

## Phase 2: Coordinated open-source release + Paper A

Phase 2 is the release. Its content is exactly the items listed here
— the engineering work to take the Phase-0-plus-Phase-1 codebase to a
state where an outside researcher can pick it up, install it, run it,
verify its claims, and cite it. It runs concurrently with the tail
end of Phase 1 (see posture notes above) and culminates in a
coordinated drop: tag → PyPI → Pages → arXiv / JOSS submission for
Paper A → reference in Paper C's Code Availability section.

### Release definition

Before enumerating the remaining work, define what "released" means.
A release candidate should satisfy all of the following:

1. **Installable on a blank machine.** `pip install neuropose` or
   `uv pip install neuropose` works on both Linux x86_64 and Apple
   Silicon Mac, with no manual steps beyond Python 3.11.
2. **Runnable without the author in the room.** The `docs/` site is
   published somewhere persistent (GitHub Pages, Cloudflare Pages),
   the getting-started walkthrough actually works end-to-end, and
   the MeTRAbs model downloads and verifies on first run.
3. **Verifiable by a reviewer.** CI runs on every push, covers both
   Linux and macOS, and a PR from a stranger could be meaningfully
   reviewed without access to the research Mac.
4. **Honest about its limits.** Every surface the release advertises
   is either exercised in CI or clearly marked experimental. No
   false promises in the README or CLI help text. (The `analyze`
   stub that motivated this item pre-Phase-0 is now real per Phase
   0's YAML pipeline, so "ship or remove" is no longer open.)
5. **Versioned.** A git tag exists, `__version__` matches, and
   `CHANGELOG.md` has a real release section, not just `[Unreleased]`.
6. **Bundled.** Paper A (tech-stack writeup) and Paper C (clinical
   validation) cite the release tag, and the release notes cite
   them. The three artifacts arrive together; reviewers of either
   paper can find and run the code.

Items below are the gaps between the end-of-Phase-0 state and that
definition.

### Apple Silicon CI matrix

**Status:** `RESEARCH.md` lists this as an open next step; no
`macos-14` entry in `.github/workflows/ci.yml`.

**Why it matters for release:** every claim of "Apple Silicon
support" is currently "by construction" — the TF 2.16+ floor ships
`darwin/arm64` wheels, the MeTRAbs SavedModel has zero custom ops, and
therefore it should work. It has not been empirically confirmed on
real hardware in an automated way. For a public release, we either
verify in CI or we stop claiming Mac support in the README.

**Scope:**

- Add a `macos-14` matrix entry to the `test` job (lint and typecheck
  stay single-platform, they're platform-independent).
- Exclude `slow` markers on macOS so we don't pay the 2 GB model
  download twice per run.
- Accept that the first green macOS run may require two or three
  hotfixes — path case sensitivity, `multiprocessing` spawn vs fork,
  shared library load order — and budget a day for that.
- Do **not** add a Metal runner. GitHub's `macos-14` runners don't
  expose the GPU to TensorFlow in a useful way, and the `[metal]`
  extra's numerical verification is a separate task that needs real
  M-series silicon we control.

**Sketch:**

```yaml
test:
  strategy:
    fail-fast: false
    matrix:
      os: [ubuntu-latest, macos-14]
  runs-on: ${{ matrix.os }}
```

Everything else in the job stays the same; `uv` works identically on
both platforms.

### Mac hardware validation pass

**Status:** Unexercised. The Shu Lab research Mac (`100.64.15.110`) is
available; we have an rsync script but no cron job, no automated
smoke check, no numerical-divergence report against the Linux
baseline.

**Why it matters for release:** CI on GitHub's `macos-14` runners
validates that the wheels install and the tests pass on Apple
Silicon. It does not validate that the real MeTRAbs model loads, that
inference runs, or that `poses3d` on the Mac matches `poses3d` on
Linux within a sane tolerance. Those are different questions, and
answering them against a throwaway runner each time would be wasteful
and unreliable.

A minimum version of this check — "does `detect_poses` produce
output on the research Mac at all?" — should happen during Phase 0
regardless, because Paper C will likely run on the same hardware and
a silent numerical divergence there would invalidate the paper's
results. The scope below is the full, release-grade version.

**Scope:**

- Run `neuropose benchmark --compare-cpu` against a reference clip on
  the research Mac. Capture the resulting `BenchmarkResult` JSON.
- Commit the JSON as `benchmarks/reference/mac_m3_ultra_cpu_v0_1.json`
  (a tracked file, not gitignored — this is the reference numerics
  we'll compare against going forward).
- Separately, run the `[metal]` path and diff. Record in
  `RESEARCH.md` whether divergence is within the ~1e-2 mm budget the
  research notes propose, or whether the Metal path is in the "use at
  your own risk" column.
- Document the findings as a new section in `RESEARCH.md` ("Apple
  Silicon verification, 2026-0X") and close the corresponding
  open-question entry.

**Open question:** should the reference JSON become a test input
(slow-marked integration test that re-runs benchmark on a developer's
machine and asserts divergence from the committed reference), or just
documentation? The former catches regressions automatically at the
cost of a 2 GB model download in the slow job; the latter is cheaper
but easier to ignore.

### Retention and pruning

**Status:** `out/` and `failed/` grow forever. No retention config.
No `neuropose prune` command.

**Why it matters for release:** a research Mac running the daemon
unattended for months will fill its disk. The first support request
will be "the daemon just stopped working" and the answer will be "you
ran out of disk." We can solve this once now, or a hundred times
later.

**Scope:**

- Add a `retention_days: int | None = None` setting (default None =
  disabled, preserving current behavior).
- When set, the daemon checks on each poll whether any job in
  `out/` or `failed/` is older than the threshold and removes it. The
  corresponding `status.json` entry transitions to a new `PRUNED`
  state (keeping the audit trail) or is removed entirely (keeping the
  status file small) — pick one and document.
- Ship a `neuropose prune [--older-than N] [--dry-run]` one-shot
  command for operators who want manual control.
- Document in `docs/deployment.md` with a recommended default (30
  days feels right for benchmark/iteration workflows; clinical
  deployments would be legal-driven and much longer).

**Open question:** should pruned jobs' `status.json` entries be
preserved as tombstones (so a user asking "where did job X go?" can
see "pruned 2026-05-01") or removed entirely? Tombstones are more
user-friendly; removal keeps the status file bounded. Default to
tombstones since the status file bound is only a problem at a scale
the 0.1 release won't hit.

### neuropose doctor preflight

**Status:** Not implemented.

**Why it matters for release:** pydantic-settings validates the
*schema* of `Settings` (is `device` a valid string, is
`poll_interval_seconds` positive). It does not validate the
*environment* — is `data_dir` writable, is the lock file acquirable,
is `model_cache_dir` on the same filesystem as `data_dir` (so
`os.rename` works atomically), is the configured TF device actually
available. Each of those is a runtime failure mode that shows up with
an ugly traceback ten seconds after `neuropose watch` starts, and
every one is cheaply detectable at startup.

**Scope:**

- New subcommand `neuropose doctor` that runs a battery of
  preflight checks and prints a pass/fail table.
- Checks to include: `data_dir` exists and is writable; lock file
  acquirable (with clean release); all three subdirectories
  (`in/out/failed`) writable; `model_cache_dir` writable and on the
  same filesystem as `data_dir`; TF is importable; configured
  `device` is in `tf.config.list_physical_devices()`;
  `tensorflow-metal` either absent or installed with a version that
  advertises support for the installed TF; XDG envvars are sane;
  Python version matches `pyproject.toml` floor.
- Exit code 0 if all checks pass, 1 if any warning, 2 if any fatal
  failure.
- The daemon's `run()` entry point calls the same underlying
  preflight function before entering the poll loop, so
  `watch`-without-doctor still gets the benefit.

**Non-scope:**

- Do not check for network access to the MeTRAbs download host.
  Network-dependent checks make CI flaky and don't match the offline
  caching behavior of real operators.

### Process supervision artifacts

**Status:** `docs/deployment.md` documents a systemd user unit as
text in prose. No file in `scripts/` that a user can actually copy.
No macOS launchd plist at all.

**Why it matters for release:** copy-paste from a docs page into a
`.service` file works, but it's friction. An open-source project with
"here is the file, here is where it goes, here is the enable command"
ships deployments faster.

**Scope:**

- Ship `scripts/systemd/neuropose.service` as a file with `%h`
  placeholders and a short install README.
- Ship `scripts/launchd/org.levineuwirth.neuropose.plist` as a file
  with an install README. (Consider making the plist label match the
  reverse-DNS of whoever is hosting — either the lab's or
  `org.neuropose.daemon` for a vendor-neutral identity.)
- Optional: a `scripts/install_service.sh` that detects the platform
  and runs the right install command. Probably not worth the
  complexity; a five-line README section per platform is fine.

**Non-scope:**

- Do not write installers for init systems we do not personally run
  (upstart, sysvinit, runit). If someone needs those, the systemd
  unit gives them enough of a template.

### Structured logging option

**Status:** Everything logs to stderr via `logging.basicConfig`
with a human-readable formatter.

**Why it matters for release:** the current format is correct for
interactive use. For any consumer that wants to feed the daemon's
output into Loki, Splunk, Grafana, Datadog, or even `jq`-based
aggregation, JSON-per-line would eliminate a parsing step. This is
a near-free feature if added now and a disruptive formatting change
if added later. It is also a prerequisite for any Track 2
audit-logging work, so building it now keeps Track 2 options open at
near-zero cost.

**Scope:**

- Add a `--log-format={human,json}` global CLI option defaulting to
  `human`.
- Implement the `json` variant as a formatter that emits
  `{"ts": ..., "level": ..., "logger": ..., "message": ..., ...}` per
  line with no log-line wrapping.
- Wire it through `_configure_logging()` so every subcommand benefits
  identically.

**Open question:** do we also want log correlation IDs per job?
That's a bigger change (pushing a context var through the
Interfacer's call stack) and probably Track 2 — skip for 0.1.

### Monitor authentication

**Status:** The monitor binds to `127.0.0.1:8765` by default. No
auth, no tokens. `--host 0.0.0.0` works but has a comment warning the
operator to think.

**Why it matters for release:** loopback-only is a reasonable
default, but the monitor is specifically marketed as the thing
collaborators can watch. "Collaborator" implies a browser somewhere
other than the daemon host. The "correct" answer (TLS, real auth) is
too expensive for 0.1; the "wrong but acceptable" answer (no auth, so
anyone who can reach the port sees everything) is what we have now.
There's a middle ground.

**Scope:**

- Add an optional `monitor_token: str | None = None` setting.
- When set, every request to `/` and `/status.json` must carry
  `?token=<value>` in the query string or `X-Status-Token` in the
  header. No token → 401.
- `neuropose serve` prints a URL including the token on startup, so
  operators can copy-paste it. If `monitor_token` is unset, behavior
  is unchanged.
- `--host 0.0.0.0` emits a stderr warning if `monitor_token` is unset
  — don't block it, just flag it.

**Non-scope:**

- TLS. Use a reverse proxy (Caddy, nginx, `ssh -L`) for any
  internet-facing exposure. The monitor is not the right place to
  terminate TLS.
- Multi-user auth, session cookies, anything with a database. That's
  Track 2.

### Docker GPU image

**Status:** `Dockerfile` exists (CPU-only). `Dockerfile.gpu`
mentioned in CHANGELOG as planned.

**Why it matters for release:** a single-file CUDA deployment story
reduces "can I run this on our lab server?" from a 45-minute dance
with conda and CUDA versions to one `docker run`. For Linux GPU
users this is the friction difference between trying the project and
bouncing.

**Scope:**

- Write `Dockerfile.gpu` on top of `nvidia/cuda:12.x-runtime-ubuntu22.04`
  (pick the version TF 2.18 actually supports — check the
  `tensorflow-gpu` compat matrix, not just "latest").
- Multi-stage: build stage has `uv` and builds the venv; final stage
  just copies the venv and sets entrypoints.
- Add a `docker-build.yml` CI workflow that builds both images on
  every push to main and publishes as `ghcr.io/neuwirth/neuropose:cpu`
  and `:gpu` (or wherever the project ends up hosted).
- Document in `docs/deployment.md` with a `docker run --gpus all`
  example.

**Non-scope:**

- A `tensorflow-metal` Docker image. Mac can't virtualize Metal, so
  there's no point.

### Dependency freshness automation

**Status:** No Dependabot, no Renovate. Everything floats until
somebody notices. The recent TF cap tightening (`<2.19`) was caught
manually because a user happened to ask; a scheduled bot would have
flagged it weeks earlier.

**Why it matters for release:** security CVEs on transitive
dependencies land every few weeks. Without automation, they get
discovered by a downstream user trying to install into an audited
environment. With automation, they become a PR you either merge or
explicitly decline.

**Scope:**

- Add `.github/dependabot.yml` with groups: `python-prod`,
  `python-dev`, `github-actions`. Weekly schedule. Ignore `tensorflow`
  updates until manually cleared (the `tensorflow-metal` constraint
  means auto-bumping TF is destructive).
- Alternative: Renovate via `renovate.json`. Renovate has better
  grouping and scheduling, Dependabot is simpler and needs no setup
  on GitHub. For an open-source Brown-lab project, Dependabot is
  enough.
- Add `uv lock --upgrade-package <name>` to the dev playbook in
  `docs/development.md` so PR authors know how to re-lock.

### Release workflow

**Status:** `[project.scripts]` is wired for `pip install`, but no
tag-triggered publishing pipeline. `.github/workflows/docs.yml`
uploads the built docs as a 14-day artifact, not to Pages.

**Why it matters for release:** "release" without a repeatable
publishing flow is a synonym for "one-off person runs hatch build on
their laptop at 11pm before the paper deadline." That is not a
release.

**Scope:**

- `.github/workflows/release.yml` triggered on version tags
  (`v[0-9]+.[0-9]+.[0-9]+`). Steps: check version matches
  `__version__`; build with `hatch build`; publish to PyPI via
  trusted publisher (no long-lived token); create GitHub release with
  changelog excerpt.
- Flip `docs.yml` to deploy the `site/` output to GitHub Pages on
  every push to `main` once the repo is public. Pin the Pages URL in
  the README and in `site_url` in `mkdocs.yml` (already points at
  `levineuwirth.github.io`, but verify).
- Sign tags with GPG; document the key fingerprint in `SECURITY.md`
  (which does not yet exist; create it).
- Consider wiring sigstore signing at the same time — see Track 2
  supply-chain section. Free after the initial setup and buys
  everything Track 2 would want without committing to the rest of
  that track.

**Open question:** do we publish under `neuropose`, `brown-neuropose`,
or something else on PyPI? Whichever name, squat it before the paper
drops — waiting means risking namesquatter abuse.

### Error-path test coverage expansion

**Status:** Happy paths and a handful of input-validation errors
covered. Not covered: disk full mid-write, corrupt video mid-decode,
OOM during inference, fcntl.flock on NFS (no-op on some kernels),
truncated zip archives, permission denied on data_dir subdirectories.

**Why it matters for release:** shipping a tool where "happy path
works" is different from shipping a tool where "when it fails, it
fails predictably." For a clinical research pipeline where a crash
mid-job quarantines valuable recording data, fault tolerance is a
feature.

**Scope:**

- Systematic pass: for each module, write a `test_<module>_failure_modes.py`
  enumerating the specific exception classes that can escape and the
  corresponding test case that triggers each one. Use `pytest.raises`
  with the exact expected exception class.
- Hardest cases use fixtures that monkeypatch system calls
  (`os.write` raises OSError(ENOSPC), `cv2.VideoCapture.read` returns
  `False, None` partway through, `fcntl.flock` raises OSError(EBADF)).
- Aim: every user-facing error message in the codebase has a test
  that proves it's reachable.

**Non-scope:**

- Chaos-engineering frameworks. `monkeypatch` is enough.
- Covering unrecoverable errors like SIGKILL of the daemon mid-frame.
  That's the recovery-on-startup test, which already exists.

---

## Track 2: Clinical platform (contingent)

Track 2 is everything beyond the open-source research tool —
multi-tenancy, audit logging, HTTP/API layer, clinician UI,
clinical-system integrations, the works. None of it is sequenced
with Phases 0–2; all of it is gated on specific triggers that don't
exist yet.

### Triggers to activate Track 2

Do not start Track 2 work until at least one of the following is
true:

1. **External demand.** Another clinical group has asked for a
   deployment they can run independently. Not a casual "interesting
   project" — a specific ask with a specific cohort and a specific
   timeline.
2. **Multi-site ambition.** The Shu Lab decides to run NeuroPose
   across more than one site within Brown-affiliated clinical
   systems, and the single-host assumption stops working.
3. **Funding mandate.** A grant or contract specifies outputs that
   the Phase 0-1-2 deliverables cannot meet (e.g. "produce a
   HIPAA-compliant deployment," "integrate with the EHR").
4. **Publication traction.** Papers A and C get engagement that
   translates into demand for a hosted version. Clinical-methods
   papers occasionally do. If enough unsolicited inquiries land,
   Track 2 becomes worth the investment.

Before at least one of these triggers: everything below is
background thinking, not planned work. *Do not refactor Phase 0 or
Phase 2 code to make Track 2 easier.* Every such refactor is a bet
on a future that may not arrive.

### Multi-tenancy and identity

**What it would require:**

- A concept of "user" distinct from "OS user." Today `Settings.data_dir`
  is one directory per OS user; multi-tenancy means one `data_dir`
  serving many logical tenants with enforced isolation.
- Per-tenant namespacing in `in/`, `out/`, `failed/`, and
  `status.json`. Cleanest is one subdirectory per tenant with the
  same four-directory layout; the daemon's discovery logic becomes a
  two-level scan.
- Authentication on the control plane. Passing tenant identity as a
  command-line arg is fine for a research prototype; a real
  deployment needs OAuth/OIDC or SAML with the institution's IdP
  (Brown CAS, epic Auth, whatever the target site uses).
- Authorization model: at minimum, "tenant A cannot see tenant B's
  jobs." For clinical deployments, probably also role-based (clinician
  / PI / admin / auditor).

**Cheapest path forward if a trigger fires:** fork the data-directory
layout into `$data_dir/<tenant_id>/{in,out,failed,status.json}`,
teach the daemon to iterate tenants in its poll loop, add a
`--tenant` flag to the CLI. That's enough for an invitation-only
deployment where tenants are identified by opaque string and issued
out-of-band.

**Expensive path:** anything involving an identity provider. Don't
go there without a real operator committing to the deployment.

### Audit logging and compliance posture

**What it would require:**

- Append-only log of every data access, write, and configuration
  change, with actor identity and timestamp. Separate from the
  application log (which rotates).
- Logs streamed to a write-once sink (S3 with object-lock,
  immutable journal) so a compromised host can't rewrite the
  trail.
- Legal review: what exactly does HIPAA require of this tool? What
  about institutional IRB? The answer will differ across sites and
  the project cannot prescribe it — but the *capability* to generate
  the required logs needs to be built in.
- Retention policy wired to the audit log, not just application
  state. Pruning job results is different from pruning audit records.

**Technical prerequisite:** structured logging from Phase 2 (which
is a cheap add and is scheduled anyway). Without JSON-per-line logs,
audit extraction is a grep-and-pray regex problem.

### HTTP/API layer

**What it would require:**

- Today the control plane is "write files to `in/`." For a
  non-filesystem-native consumer (a hosted web UI, a batch scheduler,
  a Jupyter kernel in a different container), an HTTP API is the
  right abstraction.
- FastAPI or Litestar on top of the existing ingest/interfacer/io
  modules. The daemon becomes a long-running process that serves
  requests *and* processes the input directory; or the daemon stays
  headless and the HTTP layer is a separate process talking via the
  same filesystem contract.
- OpenAPI schema published as part of the release so client code can
  be generated.

**Non-obvious pitfall:** the daemon's fcntl-based single-instance
lock assumes one writer. If the HTTP layer is a separate process, it
needs to go through the same ingest API, not directly into `in/`.
That's an easy discipline to establish if designed in from day one,
a painful refactor later.

**Cheap Phase 0/2 precaution:** keep `neuropose.ingest` and
`neuropose.interfacer` API-stable as Python modules. If a future
HTTP layer imports them, we don't want to break the import.

### Clinician-facing UI

**What it would require:**

- More than the `neuropose serve` dashboard — an actual web
  application with clinician-facing views: patient list, session
  list, session-level pose visualization, comparison against
  reference motion, exportable reports.
- Probably React + TypeScript on the frontend, consuming the HTTP
  API above. Backend-rendered templates would be faster to build but
  a worse fit for the per-session interaction model clinicians
  expect.
- WebGL or Three.js for 3D pose playback. The `neuropose.visualize`
  module is a matplotlib-based still-frame tool; rebuilding it for
  interactive 3D is a weeks-to-months project on its own.
- Accessibility: clinician environments include keyboard-only users,
  users on institutional IE holdovers (yes, still), users with
  screen readers. A research-grade UI ignores this; a clinical-grade
  one cannot.

**Scope is enormous.** This is the single largest piece of Track 2
and would likely dwarf all other Track 2 work combined. Would not
start without dedicated frontend engineering effort.

### Horizontal scaling

**What it would require:**

- A message broker (Redis Streams, RabbitMQ, or NATS) in place of the
  filesystem poll. Each job becomes a broker message; multiple
  worker processes consume and process in parallel.
- Shared storage for inputs and outputs (S3, MinIO, NFS). The
  "job_name is a directory" contract generalizes to "job_name is an
  object prefix."
- Per-worker GPU affinity for the multi-GPU case; worker auto-sizing
  based on queue depth.
- Distributed lock for the leader-only work (status file writes,
  retention enforcement).

**Upgrade path that minimizes pain:** the current single-process
daemon is equivalent to the "one worker" case of a horizontal
deployment. If the job object in `neuropose.io` stays the source of
truth (not the filesystem layout), the transition is backend-swap,
not architectural surgery. Keep that option open by treating the
filesystem as an implementation detail of `Interfacer`, not a public
contract.

### Backup, replication, and data durability

**What it would require:**

- Outputs (`out/<job>/results.json`) currently live on one disk on
  one host. For clinical data this is insufficient durability.
- Replication target: another host (hot standby), object storage
  (warm archive), or both. The `out/` directory is the canonical
  store; replicating it periodically is a scriptable cron job today.
- Proper replication: as writes happen, not as a cron. Either a
  daemon-side hook that PUTs to S3 immediately after each
  `save_job_results`, or a sidecar process watching the filesystem
  with `inotify`/`fswatch`.
- Restore story: how do we restore `out/` from backup without
  breaking `status.json` (which refers to job names by convention)?
  Test this annually.

**Minimum viable backup for Phase 2:** add a `scripts/backup.sh`
that rsyncs `$data_dir/out/` to a configurable destination. Not a
feature; a paving-the-path-for-operators artifact.

### Clinical-system integrations

**What it would require:**

- **DICOM** if videos are stored as DICOM instances rather than
  MP4. Clinical motion-analysis devices increasingly output DICOM
  video; reading DICOM means `pydicom` + some decoding logic.
- **FHIR** for patient metadata. If NeuroPose is to accept a
  patient ID and attach it to a session, that ID probably comes
  from a FHIR Patient resource. Means speaking FHIR to the hospital's
  FHIR endpoint (Epic, Cerner).
- **Redcap** integration for clinical-research cohorts (the Brown
  ecosystem uses it heavily). An export script that pulls subject
  metadata from a RedCap project and lays it into the ingest
  directory is cheap and valuable.

**Order of likely need:** RedCap first (easy, valuable, Brown-local),
then DICOM (depends on what the recording device outputs), then
FHIR (only if we're pulling from an EHR, which we probably aren't
for research).

### Deterministic inference mode

**What it would require:**

- Phase 0's `Provenance` object already captures model SHA, TF
  version, NumPy version, and a seed field. The missing piece for
  strict reproducibility is forcing TensorFlow itself to behave
  deterministically —
  `tf.config.experimental.enable_op_determinism()` plus seeding all
  of `random`, `numpy.random`, and `tf.random`.
- A `deterministic: bool = False` setting on `Settings` that flips
  the above. Default off, because deterministic mode costs a
  meaningful fraction of throughput on GPUs and isn't free on CPUs
  either. Clinical deployments would turn it on; benchmark runs
  would turn it off.
- A `Provenance.deterministic` boolean field is already in the Phase
  0 scope; this item closes the loop by actually making that
  boolean mean something.

**Cheap Phase 2 precaution:** wire the setting in Phase 2 even if we
don't flip it on by default. Future Track 2 deployments can flip it
without a code change.

### Observability and SLOs

**What it would require:**

- Prometheus metrics endpoint (separate port from the monitor, no
  auth needed on metrics, loopback or behind a scraper only).
- Counters: jobs_processed, jobs_failed, frames_processed, bytes_read,
  bytes_written. Histograms: per-frame latency, per-job latency,
  per-video latency. Gauges: queue depth, active job count.
- Tracing: OpenTelemetry instrumentation on job_process,
  detect_poses, save_job_results. Again, the interesting spans are
  the long ones, so trace-sampling at 100% is usually fine until
  throughput matters.
- Defined SLOs: "99% of jobs complete within 10× video duration,"
  "95% of monitor requests return in under 100 ms," etc.
  SLO definitions go into a `docs/slos.md`; burn-rate alerting is
  the operational half.

**Order-of-magnitude** dependency: none of this is useful without
Track 2 demand. A single-user research Mac doesn't have SLOs.

### Supply-chain attestation and signed releases

**What it would require:**

- SBOM generation on every release (CycloneDX or SPDX format,
  attached to the GitHub release and published alongside the wheel).
- Signed releases: sigstore / cosign signatures on the wheel, the
  Docker images, and the source tarball. GitHub's OIDC +
  sigstore makes this a ten-line workflow once. For a clinical tool,
  a reviewer being able to verify "this wheel is the one GitHub
  Actions produced from this commit" is non-negotiable.
- Reproducible builds: same source → same wheel hash. Python wheels
  are usually reproducible with `SOURCE_DATE_EPOCH` set and `.pyc`
  exclusion; document the exact command.
- Provenance attestations (SLSA level 2 or 3) for the CI pipeline.
  GitHub's `attestations/build-provenance` action does this.

**Cheapest Phase 2 precaution:** wire sigstore signing into the
release workflow when it's first built (see Phase 2 release workflow
section). Free after the initial setup.

### Deployment orchestration

**What it would require:**

- Kubernetes manifests (Helm chart, probably). Pod specs for the
  daemon, the monitor, the HTTP API. Separate deployments so they
  can scale independently.
- Terraform or Pulumi for the underlying infrastructure: GPU
  node pool, object storage, IAM, TLS termination. Site-dependent;
  Brown runs primarily on-prem with some AWS — the IaC would need
  to target both.
- Secrets management: Vault, AWS Secrets Manager, or K8s
  Secrets + External Secrets Operator. The monitor token, the
  broker credentials, the object-storage keys all need to stop being
  env vars in a `.service` file.

**Strong recommendation:** do not write any of this until there is
a specific deployment with specific operators. Generic K8s manifests
written without a target are a solution in search of a problem, and
they age fast.

---

## Decisions to not prematurely foreclose

A short list of choices we should avoid making in Phase 0 or Phase 2
that would make Track 2 more expensive later:

1. **Keep `neuropose.ingest` and `neuropose.interfacer` API-stable
   as Python modules.** A future HTTP layer should be able to import
   them. Avoid adding `@staticmethod` decorators that hide internal
   state; avoid coupling to global config.
2. **Keep the filesystem layout reversible.** Anything in
   `$data_dir` that is not a user artifact should be treated as
   internal. If Track 2 wants to replace the filesystem with an
   object store, the daemon's only file I/O should be via
   `neuropose.io` helpers — no raw opens scattered through the code.
3. **Keep `VideoPredictions.provenance` extensible.** The Phase 0
   `Provenance` model should be a pydantic model so fields can be
   added backward-compatibly. Don't pack provenance into free-form
   strings or nested dicts that require bespoke parsing.
4. **Keep the CLI subcommands orthogonal.** Do not add subcommands
   that wrap multiple other subcommands for convenience; that
   creates API shape we'd regret if the right composition layer
   later is HTTP, not shell.
5. **Keep model loading behind `neuropose._model`.** A future
   self-hosted model registry, signed-artifact verification, or
   multi-model switching should be a change in one file, not a
   refactor across the estimator.
6. **Keep `Settings` the single source of truth.** No `os.environ`
   reads outside pydantic-settings; no sprinkled `Path.home()`
   calls. Track 2 almost certainly overrides configuration from
   a secret store, and if that override has one place to hook in,
   it's easy.
7. **Keep status-file schema owned by pydantic, not hand-written
   JSON.** Track 2 multi-tenancy means indexing into the status
   file by tenant; a pydantic model refactor is cheap, a
   hand-written dict refactor is not.
8. **Keep the `AnalysisConfig` shape additive.** The Phase 0 YAML
   schema will evolve through Phase 1 as Paper C's experiments
   surface needs. Additions are free (new optional fields);
   renames and removals invalidate prior experiments. Pydantic's
   `extra="forbid"` catches typos at parse time while still
   allowing additive extension.

These are cheap-now / expensive-later items. Every other Track 2
decision can wait for a real trigger.