neuropose/TECHNICAL.md

1192 lines
54 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# NeuroPose Technical Ideation Notes
A living engineering roadmap, parallel to `RESEARCH.md`. Where
`RESEARCH.md` captures open methodological questions (DTW, skeleton
choice, hosting the model), this document captures open *engineering*
questions — release readiness, operability, scaling — and the paths
they could take.
This is **not** user-facing documentation. Items here are *candidates*
for future work, and inclusion does not imply commitment.
## How to use this document
- Add a section when you start thinking about a new area of technical
investment.
- Each section should end with a **Scope**, **Sketch**, or **Open
questions** block so it's obvious to a future you (or a new
contributor) what the concrete next move would be.
- When an item in here is decided and implemented, move it to the
relevant place in `docs/` or in the code itself, and leave a short
pointer behind (*See `docs/deployment.md` for the resolved design.*).
- The audience is anyone maintaining the codebase — Levi, David,
Praneeth, Dr. Shu, and whoever comes after us. Assume competence in
Python and systems work; don't assume familiarity with our specific
tooling choices.
## Three phases, then a contingent track
There are four distinct technical objectives, ordered by timeline and
by what each enables next. The sequencing is deliberate: each phase
unblocks the next, and doing them in any other order either publishes
Paper C on top of a pipeline its own design notes disavow, or delays
the open-source release past the window where the accompanying paper
is still salient.
1. **Phase 0 — C-enabling pipeline work.** A targeted subset of
engineering work that has to land *before* Paper C can start. The
DTW defaults shipped in 0.1 are explicitly a "mechanical port, not
a methodological choice" (see `RESEARCH.md` §1); running the
clinical validation study on them would mean publishing results
from a pipeline the accompanying design notes explicitly criticize.
Phase 0 fixes the analyzer's methodological foundations (Procrustes
preprocessing, cycle segmentation, joint-angle DTW representation),
locks in the reproducibility surface (`Provenance` subobject,
YAML-configurable analysis pipeline), and sets up schema migration
so data generated during Phase 1 survives the long write-up.
**Near-term, well-scoped, weeks of work.**
2. **Phase 1 — Paper C: clinical validation study.** The planned
clinical-methods paper: cycle-aware joint-angle DTW for clinical
gait similarity, validated against MoCap ground truth and/or
clinician ratings. Gated on MoCap data access via Dr. Shu. This is
research work, not engineering work — this document describes the
engineering scaffolding *around* it, not the paper itself. Phase 2
work can happen in the background during this phase as ideal filler
for research-burnout cycles. **Months; timeline driven by data
access and experimental design.**
3. **Phase 2 — Coordinated open-source release + Paper A.** The
engineering-paper companion (A) describing the tech stack, plus
the tagged 0.1 release: PyPI publication, docs deployment, Docker
images, CI matrix, supervision artifacts, doctor preflight, all
the operational items that make the tool credible to external
users. Timed to arrive *with or slightly before* Paper C's
submission, producing a paper-plus-tool bundle that reviewers can
actually run. **Weeks of work, timing driven by Paper C's
submission window.**
4. **Track 2 — Clinical platform (contingent).** Everything beyond
the open-source research tool — multi-tenancy, audit logging,
HTTP/API layer, clinician UI, clinical-system integrations. Not
sequenced; activates only if specific triggers fire (external
demand, multi-site ambition, funding mandate, publication
traction). Most of this is background thinking, not planned work.
The value of keeping it in this document is so that Phase 0 and
Phase 2 decisions don't accidentally foreclose Track 2 options.
Phases 0 → 1 → 2 form a near-term sequence that culminates in a
paper-plus-release bundle. Track 2 sits outside that sequence and
does not gate any of it.
## Contents
- [Phase 0: C-enabling pipeline work](#phase-0-c-enabling-pipeline-work)
- [Procrustes preprocessing](#procrustes-preprocessing)
- [Gait cycle segmentation](#gait-cycle-segmentation)
- [Joint-angle DTW representation](#joint-angle-dtw-representation)
- [Provenance subobject](#provenance-subobject)
- [YAML-configurable analysis pipeline](#yaml-configurable-analysis-pipeline)
- [Schema migration for VideoPredictions](#schema-migration-for-videopredictions)
- [Phase 1: Clinical validation study (Paper C)](#phase-1-clinical-validation-study-paper-c)
- [Phase 2: Coordinated open-source release + Paper A](#phase-2-coordinated-open-source-release--paper-a)
- [Release definition](#release-definition)
- [Apple Silicon CI matrix](#apple-silicon-ci-matrix)
- [Mac hardware validation pass](#mac-hardware-validation-pass)
- [Retention and pruning](#retention-and-pruning)
- [neuropose doctor preflight](#neuropose-doctor-preflight)
- [Process supervision artifacts](#process-supervision-artifacts)
- [Structured logging option](#structured-logging-option)
- [Monitor authentication](#monitor-authentication)
- [Docker GPU image](#docker-gpu-image)
- [Dependency freshness automation](#dependency-freshness-automation)
- [Release workflow](#release-workflow)
- [Error-path test coverage expansion](#error-path-test-coverage-expansion)
- [Track 2: Clinical platform (contingent)](#track-2-clinical-platform-contingent)
- [Triggers to activate Track 2](#triggers-to-activate-track-2)
- [Multi-tenancy and identity](#multi-tenancy-and-identity)
- [Audit logging and compliance posture](#audit-logging-and-compliance-posture)
- [HTTP/API layer](#httpapi-layer)
- [Clinician-facing UI](#clinician-facing-ui)
- [Horizontal scaling](#horizontal-scaling)
- [Backup, replication, and data durability](#backup-replication-and-data-durability)
- [Clinical-system integrations](#clinical-system-integrations)
- [Deterministic inference mode](#deterministic-inference-mode)
- [Observability and SLOs](#observability-and-slos)
- [Supply-chain attestation and signed releases](#supply-chain-attestation-and-signed-releases)
- [Deployment orchestration](#deployment-orchestration)
- [Decisions to not prematurely foreclose](#decisions-to-not-prematurely-foreclose)
---
## Phase 0: C-enabling pipeline work
The six items below are prerequisites for Paper C. Until they are
landed, every analysis C would produce would be running on defaults
that `RESEARCH.md` §1 explicitly flags as provisional. Ship these
first, in any order that suits the implementer's cadence, and the
rest of the project can pick up with confidence that Phase 1 results
are trustworthy.
### Procrustes preprocessing
**Status:** Not implemented. `neuropose.analyzer.features` ships
`extract_joint_angles` and feature-statistics helpers; no alignment
step exists between pose sequences.
**Why it matters for Paper C:** without alignment, DTW distance is
translation- and orientation-dependent. Two recordings of the same
subject from different camera positions produce different distances,
which is almost never what a clinician wants. Paper C's methods
section would need to apologize for this in print; cheaper to fix the
method than to defend it.
**Scope:**
- Add `procrustes_align(a: np.ndarray, b: np.ndarray, *, mode:
Literal["per_frame", "per_sequence"]) -> tuple[np.ndarray,
np.ndarray, AlignmentDiagnostics]` to `neuropose.analyzer.features`.
Implements the Kabsch algorithm (closed-form optimal rigid
transform). Per-frame aligns each frame of A to the corresponding
frame of B independently; per-sequence computes one transform over
the whole sequence. Both are useful — per-frame for fine-grained
matching, per-sequence for preserving within-trial dynamics.
- Return aligned arrays plus an `AlignmentDiagnostics` dataclass with
the fitted rotation magnitude and translation magnitude so
downstream code can flag suspiciously large transforms (usually a
sign of upstream annotation error).
- Expose as an opt-in `align: Literal["none", "procrustes_per_frame",
"procrustes_per_sequence"] = "none"` parameter on every DTW entry
point in `neuropose.analyzer.dtw`. Default `none` preserves current
behavior; Paper C's pipeline sets it to `procrustes_per_sequence`.
- Unit tests: construct a known rotation + translation between two
synthetic skeletons, verify alignment recovers it to within
floating-point precision; verify alignment of a sequence with its
own translated copy produces zero residual.
**Non-scope:**
- Non-rigid alignment (thin-plate splines, learned registration). Not
needed for skeleton-level comparison and would be a research
contribution on its own.
**Open question:** should alignment also include optional scaling
(scaled-Procrustes / full Procrustes)? For cross-subject comparison
it almost certainly should. Default to scale-preserving and add a
`scale: bool = False` flag; Paper C can flip it on for cross-subject
figures.
### Gait cycle segmentation
**Status:** `segment_by_peaks` in `neuropose.analyzer.segment`
performs generic valley-to-valley segmentation on a supplied 1D
signal. There is no gait-specific wrapper that knows to look at the
heel's vertical coordinate.
**Why it matters for Paper C:** clinical gait analysis wants to
compare *the 4th heel-strike of trial A* to *the 4th heel-strike of
trial B*, not *frame 120 of A vs frame 120 of B*. Per-cycle DTW is
the standard approach in the biomechanics literature (Sadeghi et al.
2000 and descendants); running full-trial DTW on gait is a choice
reviewers of Paper C would correctly flag as methodologically weak.
**Scope:**
- New `segment_gait_cycles(predictions: VideoPredictions, *, joint:
str = "rhee", axis: Literal["x", "y", "z"] = "y", min_cycle_seconds:
float = 0.4) -> Segmentation` in `neuropose.analyzer.segment`.
- Under the hood: extract the specified joint's coordinate along the
specified axis, apply `segment_by_peaks` with appropriate distance
and prominence thresholds (derived from `min_cycle_seconds` via
`predictions.metadata.fps`), return the resulting `Segmentation`
(the existing `neuropose.io.Segmentation` type) so downstream
tooling picks it up unchanged.
- Two-sided detection: run the same detection on the opposite heel
and return *both* per-side segmentations under named keys
(`left_heel_strikes`, `right_heel_strikes`). Clinical users will
want both.
- Allow the reference joint and axis to be configurable so trials
recorded with a different camera orientation (lateral vs frontal
vs oblique) can still be segmented without a code change.
**Non-scope:**
- HMM-based cycle detection, learned cycle detectors. Peak detection
on vertical coordinate is standard, well-understood, and the
method the biomechanics literature expects to see.
- Handling pathological gaits where heel-strikes are absent
(shuffling, walker-assisted). The function should degrade
gracefully (return a `Segmentation` with an empty list, not raise),
and Paper C's data-quality filtering handles the rest.
**Open question:** should the function also emit a "confidence" per
cycle (prominence of the detected peak, regularity of spacing) that
Paper C can use to filter out low-quality detections? Cheap to add,
useful downstream. Recommend yes.
### Joint-angle DTW representation
**Status:** `dtw_all`, `dtw_per_joint`, and `dtw_relation` operate on
raw 3D coordinates or joint-pair displacements. `extract_joint_angles`
produces per-frame angle sequences but is not wired as a DTW input.
**Why it matters for Paper C:** angle-space DTW is translation- and
rotation-invariant by construction, scale-invariant if normalized,
and directly interpretable in clinical terms ("knee flexion angle
during swing phase"). Paper C's headline figures almost certainly
use angle-space distances; raw coordinates would draw the obvious
reviewer question of why we aren't comparing the thing clinicians
actually measure.
**Scope:**
- Add `representation: Literal["coords", "angles", "relation"] =
"coords"` to every DTW entry point. The `coords` default preserves
existing behavior; `angles` runs `extract_joint_angles` on each
input first; `relation` is the existing `dtw_relation` path
expressed as a representation choice rather than a separate
function (leaving the `dtw_relation` name as a convenience wrapper
if preferred).
- Degenerate-vector handling: `extract_joint_angles` returns NaN for
degenerate (zero-length) vectors. The DTW path needs to decide how
to handle NaN — skip-and-interpolate, drop, or propagate to the
distance. Propagation is safest (makes the problem visible);
interpolation is what clinical users probably want day-to-day.
Default to propagation and expose `nan_policy: Literal["propagate",
"interpolate", "drop"]` for experimentation.
- Tests: synthetic pair with known angular difference, assert DTW in
angle-space recovers it independent of global rotation applied to
the input.
**Non-scope:**
- Quaternion or SO(3) rotation-space DTW. Interesting but requires a
rotation parameterization the current skeleton output does not
produce.
- Mixed-representation (position + angle concatenated, learned
embeddings). These are experiments Paper C might run; they don't
belong in Phase 0 infrastructure.
### Provenance subobject
**Status:** `PerformanceMetrics` captures `tensorflow_version`,
`active_device`, and `tensorflow_metal_active`. Model SHA is not
computed or propagated. `numpy_version` and `neuropose_version` are
not recorded. No first-class `Provenance` object.
**Why it matters for Paper C:** reproducibility is the first
question a reviewer asks of a clinical-methods paper. The answer
needs to be "same model artifact, same pipeline config, same
versions, same seeds" — and all four need to be recorded on every
`results.json` that underlies a paper figure. Not having this means
either manually tracking it in a lab notebook (fragile, won't
survive personnel turnover) or running every experiment through a
pinned Docker image (expensive, doesn't capture runtime
non-determinism). The subobject is the cheap right answer.
**Scope:**
- New `Provenance` pydantic model in `neuropose.io` with fields:
`model_sha256: str`, `model_filename: str`, `tensorflow_version:
str`, `tensorflow_metal_version: str | None`,
`numpy_version: str`, `neuropose_version: str`, `python_version:
str`, `seed: int | None`, `deterministic: bool`, `analysis_config:
dict | None` (the YAML of the run if the pipeline was invoked via
`neuropose analyze --config`).
- Optional `provenance: Provenance | None = None` field on
`VideoPredictions`, `JobResults`, and `BenchmarkResult`. None-valued
on legacy files (enabled by schema migration — see below), populated
on every new write.
- `_model.py` hashes the downloaded tarball on first load (after the
existing SHA verification — the two checks use the same hash so
compute is amortized) and exposes the hash via a
`get_model_sha256()` method on the `Estimator`. `Interfacer._run_job_inner`
constructs the `Provenance` and attaches it to the output.
- Unit test: serialize → JSON → deserialize round-trip identity;
assert `model_sha256` matches the SHA recorded in
`neuropose._model`.
**Non-scope:**
- Cryptographic signatures on results.json. That's Phase 2 (sigstore
on release artifacts) or Track 2 (per-output signing) territory,
not Phase 0.
- Provenance on arbitrary intermediate products (numpy arrays, DTW
distance matrices). Top-level JSONs cover Paper C's needs; richer
intermediates can inherit from a hand-off if needed.
**Open question:** does Paper C need *per-frame* provenance (which
frame was processed with which configuration) or just per-job
provenance? Per-job is enough for reproducibility; per-frame is only
useful if we want to mix configurations within a single job, which
has no current use case.
### YAML-configurable analysis pipeline
**Status:** `neuropose.cli`'s `analyze` subcommand is a stub that
raises `NotImplementedError`. Analysis operations are called
individually from Python, or via CLI flags on `segment` and
`benchmark`. No unified representation of "a complete analysis run."
**Why it matters for Paper C:** the paper will run many experimental
configurations — alignment on/off, per-frame vs per-sequence, raw
coordinates vs joint angles, full-trial vs cycle-segmented DTW,
various distance metrics. Each experiment should be reproducible
from a single file that can be version-controlled, diffed, attached
to the `Provenance` object, and cited in the paper. A Python script
full of kwargs is the alternative, and it's exactly the alternative
the open-source community collectively decided against ten years ago.
This item also resolves the "`neuropose analyze`: ship or remove"
question that was previously open: we are shipping `analyze`, just
specifically in a YAML-driven form. The stub that currently exists
becomes the real command in Phase 0.
**Scope:**
- `AnalysisConfig` pydantic model in `neuropose.analyzer` capturing
the full pipeline: input source (predictions file path),
preprocessing (`align`, `normalize`, `segment`), per-segment
analysis (DTW backend, representation, distance function, extra
kwargs), output (figures, statistics, distance matrices).
- Parseable from YAML via pydantic; validated on parse so typos in
field names fail early with a clear error.
- `neuropose analyze --config experiment.yaml [--output
results_042.json]` runs the pipeline end-to-end. The config YAML
is serialized into the resulting `Provenance.analysis_config`, so
the output file is self-describing.
- Ship three or four *example* configs under `examples/analysis/`
that exercise the full matrix of alignment × representation ×
segmentation choices Paper C will care about. Double as integration
tests.
**Non-scope:**
- A DAG / workflow engine (Snakemake, Nextflow). A flat config is
enough for Paper C's needs; reach for a DAG tool only when
experiments have genuine inter-stage dependencies, which analysis
of a single video does not.
- Parallel sweep execution. Run multiple configs via a shell loop
for now (`for cfg in examples/analysis/*.yaml; do neuropose
analyze --config "$cfg" --output "out/$(basename "$cfg" .yaml).json"; done`).
A real sweep orchestrator is Track 2.
**Open question:** should there be a `neuropose analyze compare
<config_a.yaml> <config_b.yaml>` subcommand that runs both and
emits a diff figure? Useful for Paper C but not a gating feature —
post-Phase-0 addition if the need is clear.
### Schema migration for VideoPredictions
**Status:** `VideoPredictions` gained `segmentations: dict[str,
Segmentation] = Field(default_factory=dict)` during recent work. Old
JSON files without the field still load (pydantic default-factories
back-fill), but this is accidental rather than designed-in.
**Why it matters for Paper C:** Paper C will produce analysis results
over the course of 6-12 months. During that window, Phase 0 work
itself will evolve — the `Provenance` object will gain fields, the
`AnalysisConfig` shape will stabilize, maybe the `Segmentation` schema
will extend. Without migration support, every schema change would
invalidate some portion of Paper C's already-generated data, forcing
either a freeze (drops velocity) or a full re-run (wastes compute).
Migration now is the cheap fix.
**Scope:**
- Add a `schema_version: int = 1` field to `VideoPredictions`,
`JobResults`, and `BenchmarkResult` (the three load-anywhere
top-level schemas).
- Write `migrate_video_predictions(payload: dict) -> dict` that
takes a raw JSON-loaded dict, dispatches on `schema_version`, and
returns a dict conformant with the current version. Default to 1
when missing (existing files).
- Wire it into `load_video_predictions()` so the migration runs
before pydantic validation. Log at INFO on migration so users see
when files are being upgraded.
- When writing, always write the current version.
**Non-scope:**
- A general-purpose migration framework. A function that dispatches
on an integer is sufficient until we have three versions.
- In-place migration (writing back the upgraded file). Migrations
should run on read; write-back is a separate operator decision.
---
## Phase 1: Clinical validation study (Paper C)
Phase 1 is *Paper C itself* — the clinical-methods paper this project
exists to produce. The content belongs in the paper, in `RESEARCH.md`,
and in the analysis-config YAMLs under `examples/`, not here. This
section exists only to demarcate the phase and to capture the
engineering commitments that should (and should not) happen during it.
**Engineering posture during Phase 1:**
- **Phase 0 is frozen on entry.** Don't refactor the analyzer during
Phase 1; refactors invalidate earlier experiments. If a Phase 0
shortcoming surfaces during paper-writing, log it in `RESEARCH.md`
and revisit after submission.
- **Phase 2 work is welcome as background.** Writing a launchd plist,
wiring up Dependabot, tightening error-path tests — all of this is
ideal filler work during the experimental-design and writing
phases of Paper C. It consumes different energy than research work
does, and the tool is in better shape on submission day as a
result.
- **`RESEARCH.md` gets the bulk of the updates.** Methods decisions,
reading-list expansions, reviewer-response notes all live there,
not here.
- **Do add engineering-side notes here** when a Paper C experiment
reveals a piece of missing tooling that's worth a Phase 2 item
(for example: "we needed batch-analysis across 200 trials and hit
this, so Phase 2 should include ..."). Phase 1 is the best
possible source of prioritization signal for what Phase 2 is
actually worth.
**Prerequisite outside this document:** a MoCap-data-access
conversation with Dr. Shu. Nothing in Phase 1 can start until that
conversation has resolved. `RESEARCH.md` §3 flags this as the
gating question for fine-tuning; it is equally the gating question
for validation.
---
## Phase 2: Coordinated open-source release + Paper A
Phase 2 is the release. Its content is exactly the items listed here
— the engineering work to take the Phase-0-plus-Phase-1 codebase to a
state where an outside researcher can pick it up, install it, run it,
verify its claims, and cite it. It runs concurrently with the tail
end of Phase 1 (see posture notes above) and culminates in a
coordinated drop: tag → PyPI → Pages → arXiv / JOSS submission for
Paper A → reference in Paper C's Code Availability section.
### Release definition
Before enumerating the remaining work, define what "released" means.
A release candidate should satisfy all of the following:
1. **Installable on a blank machine.** `pip install neuropose` or
`uv pip install neuropose` works on both Linux x86_64 and Apple
Silicon Mac, with no manual steps beyond Python 3.11.
2. **Runnable without the author in the room.** The `docs/` site is
published somewhere persistent (GitHub Pages, Cloudflare Pages),
the getting-started walkthrough actually works end-to-end, and
the MeTRAbs model downloads and verifies on first run.
3. **Verifiable by a reviewer.** CI runs on every push, covers both
Linux and macOS, and a PR from a stranger could be meaningfully
reviewed without access to the research Mac.
4. **Honest about its limits.** Every surface the release advertises
is either exercised in CI or clearly marked experimental. No
false promises in the README or CLI help text. (The `analyze`
stub that motivated this item pre-Phase-0 is now real per Phase
0's YAML pipeline, so "ship or remove" is no longer open.)
5. **Versioned.** A git tag exists, `__version__` matches, and
`CHANGELOG.md` has a real release section, not just `[Unreleased]`.
6. **Bundled.** Paper A (tech-stack writeup) and Paper C (clinical
validation) cite the release tag, and the release notes cite
them. The three artifacts arrive together; reviewers of either
paper can find and run the code.
Items below are the gaps between the end-of-Phase-0 state and that
definition.
### Apple Silicon CI matrix
**Status:** `RESEARCH.md` lists this as an open next step; no
`macos-14` entry in `.github/workflows/ci.yml`.
**Why it matters for release:** every claim of "Apple Silicon
support" is currently "by construction" — the TF 2.16+ floor ships
`darwin/arm64` wheels, the MeTRAbs SavedModel has zero custom ops, and
therefore it should work. It has not been empirically confirmed on
real hardware in an automated way. For a public release, we either
verify in CI or we stop claiming Mac support in the README.
**Scope:**
- Add a `macos-14` matrix entry to the `test` job (lint and typecheck
stay single-platform, they're platform-independent).
- Exclude `slow` markers on macOS so we don't pay the 2 GB model
download twice per run.
- Accept that the first green macOS run may require two or three
hotfixes — path case sensitivity, `multiprocessing` spawn vs fork,
shared library load order — and budget a day for that.
- Do **not** add a Metal runner. GitHub's `macos-14` runners don't
expose the GPU to TensorFlow in a useful way, and the `[metal]`
extra's numerical verification is a separate task that needs real
M-series silicon we control.
**Sketch:**
```yaml
test:
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-14]
runs-on: ${{ matrix.os }}
```
Everything else in the job stays the same; `uv` works identically on
both platforms.
### Mac hardware validation pass
**Status:** Unexercised. The Shu Lab research Mac (`100.64.15.110`) is
available; we have an rsync script but no cron job, no automated
smoke check, no numerical-divergence report against the Linux
baseline.
**Why it matters for release:** CI on GitHub's `macos-14` runners
validates that the wheels install and the tests pass on Apple
Silicon. It does not validate that the real MeTRAbs model loads, that
inference runs, or that `poses3d` on the Mac matches `poses3d` on
Linux within a sane tolerance. Those are different questions, and
answering them against a throwaway runner each time would be wasteful
and unreliable.
A minimum version of this check — "does `detect_poses` produce
output on the research Mac at all?" — should happen during Phase 0
regardless, because Paper C will likely run on the same hardware and
a silent numerical divergence there would invalidate the paper's
results. The scope below is the full, release-grade version.
**Scope:**
- Run `neuropose benchmark --compare-cpu` against a reference clip on
the research Mac. Capture the resulting `BenchmarkResult` JSON.
- Commit the JSON as `benchmarks/reference/mac_m3_ultra_cpu_v0_1.json`
(a tracked file, not gitignored — this is the reference numerics
we'll compare against going forward).
- Separately, run the `[metal]` path and diff. Record in
`RESEARCH.md` whether divergence is within the ~1e-2 mm budget the
research notes propose, or whether the Metal path is in the "use at
your own risk" column.
- Document the findings as a new section in `RESEARCH.md` ("Apple
Silicon verification, 2026-0X") and close the corresponding
open-question entry.
**Open question:** should the reference JSON become a test input
(slow-marked integration test that re-runs benchmark on a developer's
machine and asserts divergence from the committed reference), or just
documentation? The former catches regressions automatically at the
cost of a 2 GB model download in the slow job; the latter is cheaper
but easier to ignore.
### Retention and pruning
**Status:** `out/` and `failed/` grow forever. No retention config.
No `neuropose prune` command.
**Why it matters for release:** a research Mac running the daemon
unattended for months will fill its disk. The first support request
will be "the daemon just stopped working" and the answer will be "you
ran out of disk." We can solve this once now, or a hundred times
later.
**Scope:**
- Add a `retention_days: int | None = None` setting (default None =
disabled, preserving current behavior).
- When set, the daemon checks on each poll whether any job in
`out/` or `failed/` is older than the threshold and removes it. The
corresponding `status.json` entry transitions to a new `PRUNED`
state (keeping the audit trail) or is removed entirely (keeping the
status file small) — pick one and document.
- Ship a `neuropose prune [--older-than N] [--dry-run]` one-shot
command for operators who want manual control.
- Document in `docs/deployment.md` with a recommended default (30
days feels right for benchmark/iteration workflows; clinical
deployments would be legal-driven and much longer).
**Open question:** should pruned jobs' `status.json` entries be
preserved as tombstones (so a user asking "where did job X go?" can
see "pruned 2026-05-01") or removed entirely? Tombstones are more
user-friendly; removal keeps the status file bounded. Default to
tombstones since the status file bound is only a problem at a scale
the 0.1 release won't hit.
### neuropose doctor preflight
**Status:** Not implemented.
**Why it matters for release:** pydantic-settings validates the
*schema* of `Settings` (is `device` a valid string, is
`poll_interval_seconds` positive). It does not validate the
*environment* — is `data_dir` writable, is the lock file acquirable,
is `model_cache_dir` on the same filesystem as `data_dir` (so
`os.rename` works atomically), is the configured TF device actually
available. Each of those is a runtime failure mode that shows up with
an ugly traceback ten seconds after `neuropose watch` starts, and
every one is cheaply detectable at startup.
**Scope:**
- New subcommand `neuropose doctor` that runs a battery of
preflight checks and prints a pass/fail table.
- Checks to include: `data_dir` exists and is writable; lock file
acquirable (with clean release); all three subdirectories
(`in/out/failed`) writable; `model_cache_dir` writable and on the
same filesystem as `data_dir`; TF is importable; configured
`device` is in `tf.config.list_physical_devices()`;
`tensorflow-metal` either absent or installed with a version that
advertises support for the installed TF; XDG envvars are sane;
Python version matches `pyproject.toml` floor.
- Exit code 0 if all checks pass, 1 if any warning, 2 if any fatal
failure.
- The daemon's `run()` entry point calls the same underlying
preflight function before entering the poll loop, so
`watch`-without-doctor still gets the benefit.
**Non-scope:**
- Do not check for network access to the MeTRAbs download host.
Network-dependent checks make CI flaky and don't match the offline
caching behavior of real operators.
### Process supervision artifacts
**Status:** `docs/deployment.md` documents a systemd user unit as
text in prose. No file in `scripts/` that a user can actually copy.
No macOS launchd plist at all.
**Why it matters for release:** copy-paste from a docs page into a
`.service` file works, but it's friction. An open-source project with
"here is the file, here is where it goes, here is the enable command"
ships deployments faster.
**Scope:**
- Ship `scripts/systemd/neuropose.service` as a file with `%h`
placeholders and a short install README.
- Ship `scripts/launchd/org.levineuwirth.neuropose.plist` as a file
with an install README. (Consider making the plist label match the
reverse-DNS of whoever is hosting — either the lab's or
`org.neuropose.daemon` for a vendor-neutral identity.)
- Optional: a `scripts/install_service.sh` that detects the platform
and runs the right install command. Probably not worth the
complexity; a five-line README section per platform is fine.
**Non-scope:**
- Do not write installers for init systems we do not personally run
(upstart, sysvinit, runit). If someone needs those, the systemd
unit gives them enough of a template.
### Structured logging option
**Status:** Everything logs to stderr via `logging.basicConfig`
with a human-readable formatter.
**Why it matters for release:** the current format is correct for
interactive use. For any consumer that wants to feed the daemon's
output into Loki, Splunk, Grafana, Datadog, or even `jq`-based
aggregation, JSON-per-line would eliminate a parsing step. This is
a near-free feature if added now and a disruptive formatting change
if added later. It is also a prerequisite for any Track 2
audit-logging work, so building it now keeps Track 2 options open at
near-zero cost.
**Scope:**
- Add a `--log-format={human,json}` global CLI option defaulting to
`human`.
- Implement the `json` variant as a formatter that emits
`{"ts": ..., "level": ..., "logger": ..., "message": ..., ...}` per
line with no log-line wrapping.
- Wire it through `_configure_logging()` so every subcommand benefits
identically.
**Open question:** do we also want log correlation IDs per job?
That's a bigger change (pushing a context var through the
Interfacer's call stack) and probably Track 2 — skip for 0.1.
### Monitor authentication
**Status:** The monitor binds to `127.0.0.1:8765` by default. No
auth, no tokens. `--host 0.0.0.0` works but has a comment warning the
operator to think.
**Why it matters for release:** loopback-only is a reasonable
default, but the monitor is specifically marketed as the thing
collaborators can watch. "Collaborator" implies a browser somewhere
other than the daemon host. The "correct" answer (TLS, real auth) is
too expensive for 0.1; the "wrong but acceptable" answer (no auth, so
anyone who can reach the port sees everything) is what we have now.
There's a middle ground.
**Scope:**
- Add an optional `monitor_token: str | None = None` setting.
- When set, every request to `/` and `/status.json` must carry
`?token=<value>` in the query string or `X-Status-Token` in the
header. No token → 401.
- `neuropose serve` prints a URL including the token on startup, so
operators can copy-paste it. If `monitor_token` is unset, behavior
is unchanged.
- `--host 0.0.0.0` emits a stderr warning if `monitor_token` is unset
— don't block it, just flag it.
**Non-scope:**
- TLS. Use a reverse proxy (Caddy, nginx, `ssh -L`) for any
internet-facing exposure. The monitor is not the right place to
terminate TLS.
- Multi-user auth, session cookies, anything with a database. That's
Track 2.
### Docker GPU image
**Status:** `Dockerfile` exists (CPU-only). `Dockerfile.gpu`
mentioned in CHANGELOG as planned.
**Why it matters for release:** a single-file CUDA deployment story
reduces "can I run this on our lab server?" from a 45-minute dance
with conda and CUDA versions to one `docker run`. For Linux GPU
users this is the friction difference between trying the project and
bouncing.
**Scope:**
- Write `Dockerfile.gpu` on top of `nvidia/cuda:12.x-runtime-ubuntu22.04`
(pick the version TF 2.18 actually supports — check the
`tensorflow-gpu` compat matrix, not just "latest").
- Multi-stage: build stage has `uv` and builds the venv; final stage
just copies the venv and sets entrypoints.
- Add a `docker-build.yml` CI workflow that builds both images on
every push to main and publishes as `ghcr.io/neuwirth/neuropose:cpu`
and `:gpu` (or wherever the project ends up hosted).
- Document in `docs/deployment.md` with a `docker run --gpus all`
example.
**Non-scope:**
- A `tensorflow-metal` Docker image. Mac can't virtualize Metal, so
there's no point.
### Dependency freshness automation
**Status:** No Dependabot, no Renovate. Everything floats until
somebody notices. The recent TF cap tightening (`<2.19`) was caught
manually because a user happened to ask; a scheduled bot would have
flagged it weeks earlier.
**Why it matters for release:** security CVEs on transitive
dependencies land every few weeks. Without automation, they get
discovered by a downstream user trying to install into an audited
environment. With automation, they become a PR you either merge or
explicitly decline.
**Scope:**
- Add `.github/dependabot.yml` with groups: `python-prod`,
`python-dev`, `github-actions`. Weekly schedule. Ignore `tensorflow`
updates until manually cleared (the `tensorflow-metal` constraint
means auto-bumping TF is destructive).
- Alternative: Renovate via `renovate.json`. Renovate has better
grouping and scheduling, Dependabot is simpler and needs no setup
on GitHub. For an open-source Brown-lab project, Dependabot is
enough.
- Add `uv lock --upgrade-package <name>` to the dev playbook in
`docs/development.md` so PR authors know how to re-lock.
### Release workflow
**Status:** `[project.scripts]` is wired for `pip install`, but no
tag-triggered publishing pipeline. `.github/workflows/docs.yml`
uploads the built docs as a 14-day artifact, not to Pages.
**Why it matters for release:** "release" without a repeatable
publishing flow is a synonym for "one-off person runs hatch build on
their laptop at 11pm before the paper deadline." That is not a
release.
**Scope:**
- `.github/workflows/release.yml` triggered on version tags
(`v[0-9]+.[0-9]+.[0-9]+`). Steps: check version matches
`__version__`; build with `hatch build`; publish to PyPI via
trusted publisher (no long-lived token); create GitHub release with
changelog excerpt.
- Flip `docs.yml` to deploy the `site/` output to GitHub Pages on
every push to `main` once the repo is public. Pin the Pages URL in
the README and in `site_url` in `mkdocs.yml` (already points at
`levineuwirth.github.io`, but verify).
- Sign tags with GPG; document the key fingerprint in `SECURITY.md`
(which does not yet exist; create it).
- Consider wiring sigstore signing at the same time see Track 2
supply-chain section. Free after the initial setup and buys
everything Track 2 would want without committing to the rest of
that track.
**Open question:** do we publish under `neuropose`, `brown-neuropose`,
or something else on PyPI? Whichever name, squat it before the paper
drops waiting means risking namesquatter abuse.
### Error-path test coverage expansion
**Status:** Happy paths and a handful of input-validation errors
covered. Not covered: disk full mid-write, corrupt video mid-decode,
OOM during inference, fcntl.flock on NFS (no-op on some kernels),
truncated zip archives, permission denied on data_dir subdirectories.
**Why it matters for release:** shipping a tool where "happy path
works" is different from shipping a tool where "when it fails, it
fails predictably." For a clinical research pipeline where a crash
mid-job quarantines valuable recording data, fault tolerance is a
feature.
**Scope:**
- Systematic pass: for each module, write a `test_<module>_failure_modes.py`
enumerating the specific exception classes that can escape and the
corresponding test case that triggers each one. Use `pytest.raises`
with the exact expected exception class.
- Hardest cases use fixtures that monkeypatch system calls
(`os.write` raises OSError(ENOSPC), `cv2.VideoCapture.read` returns
`False, None` partway through, `fcntl.flock` raises OSError(EBADF)).
- Aim: every user-facing error message in the codebase has a test
that proves it's reachable.
**Non-scope:**
- Chaos-engineering frameworks. `monkeypatch` is enough.
- Covering unrecoverable errors like SIGKILL of the daemon mid-frame.
That's the recovery-on-startup test, which already exists.
---
## Track 2: Clinical platform (contingent)
Track 2 is everything beyond the open-source research tool
multi-tenancy, audit logging, HTTP/API layer, clinician UI,
clinical-system integrations, the works. None of it is sequenced
with Phases 02; all of it is gated on specific triggers that don't
exist yet.
### Triggers to activate Track 2
Do not start Track 2 work until at least one of the following is
true:
1. **External demand.** Another clinical group has asked for a
deployment they can run independently. Not a casual "interesting
project" a specific ask with a specific cohort and a specific
timeline.
2. **Multi-site ambition.** The Shu Lab decides to run NeuroPose
across more than one site within Brown-affiliated clinical
systems, and the single-host assumption stops working.
3. **Funding mandate.** A grant or contract specifies outputs that
the Phase 0-1-2 deliverables cannot meet (e.g. "produce a
HIPAA-compliant deployment," "integrate with the EHR").
4. **Publication traction.** Papers A and C get engagement that
translates into demand for a hosted version. Clinical-methods
papers occasionally do. If enough unsolicited inquiries land,
Track 2 becomes worth the investment.
Before at least one of these triggers: everything below is
background thinking, not planned work. *Do not refactor Phase 0 or
Phase 2 code to make Track 2 easier.* Every such refactor is a bet
on a future that may not arrive.
### Multi-tenancy and identity
**What it would require:**
- A concept of "user" distinct from "OS user." Today `Settings.data_dir`
is one directory per OS user; multi-tenancy means one `data_dir`
serving many logical tenants with enforced isolation.
- Per-tenant namespacing in `in/`, `out/`, `failed/`, and
`status.json`. Cleanest is one subdirectory per tenant with the
same four-directory layout; the daemon's discovery logic becomes a
two-level scan.
- Authentication on the control plane. Passing tenant identity as a
command-line arg is fine for a research prototype; a real
deployment needs OAuth/OIDC or SAML with the institution's IdP
(Brown CAS, epic Auth, whatever the target site uses).
- Authorization model: at minimum, "tenant A cannot see tenant B's
jobs." For clinical deployments, probably also role-based (clinician
/ PI / admin / auditor).
**Cheapest path forward if a trigger fires:** fork the data-directory
layout into `$data_dir/<tenant_id>/{in,out,failed,status.json}`,
teach the daemon to iterate tenants in its poll loop, add a
`--tenant` flag to the CLI. That's enough for an invitation-only
deployment where tenants are identified by opaque string and issued
out-of-band.
**Expensive path:** anything involving an identity provider. Don't
go there without a real operator committing to the deployment.
### Audit logging and compliance posture
**What it would require:**
- Append-only log of every data access, write, and configuration
change, with actor identity and timestamp. Separate from the
application log (which rotates).
- Logs streamed to a write-once sink (S3 with object-lock,
immutable journal) so a compromised host can't rewrite the
trail.
- Legal review: what exactly does HIPAA require of this tool? What
about institutional IRB? The answer will differ across sites and
the project cannot prescribe it but the *capability* to generate
the required logs needs to be built in.
- Retention policy wired to the audit log, not just application
state. Pruning job results is different from pruning audit records.
**Technical prerequisite:** structured logging from Phase 2 (which
is a cheap add and is scheduled anyway). Without JSON-per-line logs,
audit extraction is a grep-and-pray regex problem.
### HTTP/API layer
**What it would require:**
- Today the control plane is "write files to `in/`." For a
non-filesystem-native consumer (a hosted web UI, a batch scheduler,
a Jupyter kernel in a different container), an HTTP API is the
right abstraction.
- FastAPI or Litestar on top of the existing ingest/interfacer/io
modules. The daemon becomes a long-running process that serves
requests *and* processes the input directory; or the daemon stays
headless and the HTTP layer is a separate process talking via the
same filesystem contract.
- OpenAPI schema published as part of the release so client code can
be generated.
**Non-obvious pitfall:** the daemon's fcntl-based single-instance
lock assumes one writer. If the HTTP layer is a separate process, it
needs to go through the same ingest API, not directly into `in/`.
That's an easy discipline to establish if designed in from day one,
a painful refactor later.
**Cheap Phase 0/2 precaution:** keep `neuropose.ingest` and
`neuropose.interfacer` API-stable as Python modules. If a future
HTTP layer imports them, we don't want to break the import.
### Clinician-facing UI
**What it would require:**
- More than the `neuropose serve` dashboard an actual web
application with clinician-facing views: patient list, session
list, session-level pose visualization, comparison against
reference motion, exportable reports.
- Probably React + TypeScript on the frontend, consuming the HTTP
API above. Backend-rendered templates would be faster to build but
a worse fit for the per-session interaction model clinicians
expect.
- WebGL or Three.js for 3D pose playback. The `neuropose.visualize`
module is a matplotlib-based still-frame tool; rebuilding it for
interactive 3D is a weeks-to-months project on its own.
- Accessibility: clinician environments include keyboard-only users,
users on institutional IE holdovers (yes, still), users with
screen readers. A research-grade UI ignores this; a clinical-grade
one cannot.
**Scope is enormous.** This is the single largest piece of Track 2
and would likely dwarf all other Track 2 work combined. Would not
start without dedicated frontend engineering effort.
### Horizontal scaling
**What it would require:**
- A message broker (Redis Streams, RabbitMQ, or NATS) in place of the
filesystem poll. Each job becomes a broker message; multiple
worker processes consume and process in parallel.
- Shared storage for inputs and outputs (S3, MinIO, NFS). The
"job_name is a directory" contract generalizes to "job_name is an
object prefix."
- Per-worker GPU affinity for the multi-GPU case; worker auto-sizing
based on queue depth.
- Distributed lock for the leader-only work (status file writes,
retention enforcement).
**Upgrade path that minimizes pain:** the current single-process
daemon is equivalent to the "one worker" case of a horizontal
deployment. If the job object in `neuropose.io` stays the source of
truth (not the filesystem layout), the transition is backend-swap,
not architectural surgery. Keep that option open by treating the
filesystem as an implementation detail of `Interfacer`, not a public
contract.
### Backup, replication, and data durability
**What it would require:**
- Outputs (`out/<job>/results.json`) currently live on one disk on
one host. For clinical data this is insufficient durability.
- Replication target: another host (hot standby), object storage
(warm archive), or both. The `out/` directory is the canonical
store; replicating it periodically is a scriptable cron job today.
- Proper replication: as writes happen, not as a cron. Either a
daemon-side hook that PUTs to S3 immediately after each
`save_job_results`, or a sidecar process watching the filesystem
with `inotify`/`fswatch`.
- Restore story: how do we restore `out/` from backup without
breaking `status.json` (which refers to job names by convention)?
Test this annually.
**Minimum viable backup for Phase 2:** add a `scripts/backup.sh`
that rsyncs `$data_dir/out/` to a configurable destination. Not a
feature; a paving-the-path-for-operators artifact.
### Clinical-system integrations
**What it would require:**
- **DICOM** if videos are stored as DICOM instances rather than
MP4. Clinical motion-analysis devices increasingly output DICOM
video; reading DICOM means `pydicom` + some decoding logic.
- **FHIR** for patient metadata. If NeuroPose is to accept a
patient ID and attach it to a session, that ID probably comes
from a FHIR Patient resource. Means speaking FHIR to the hospital's
FHIR endpoint (Epic, Cerner).
- **Redcap** integration for clinical-research cohorts (the Brown
ecosystem uses it heavily). An export script that pulls subject
metadata from a RedCap project and lays it into the ingest
directory is cheap and valuable.
**Order of likely need:** RedCap first (easy, valuable, Brown-local),
then DICOM (depends on what the recording device outputs), then
FHIR (only if we're pulling from an EHR, which we probably aren't
for research).
### Deterministic inference mode
**What it would require:**
- Phase 0's `Provenance` object already captures model SHA, TF
version, NumPy version, and a seed field. The missing piece for
strict reproducibility is forcing TensorFlow itself to behave
deterministically —
`tf.config.experimental.enable_op_determinism()` plus seeding all
of `random`, `numpy.random`, and `tf.random`.
- A `deterministic: bool = False` setting on `Settings` that flips
the above. Default off, because deterministic mode costs a
meaningful fraction of throughput on GPUs and isn't free on CPUs
either. Clinical deployments would turn it on; benchmark runs
would turn it off.
- A `Provenance.deterministic` boolean field is already in the Phase
0 scope; this item closes the loop by actually making that
boolean mean something.
**Cheap Phase 2 precaution:** wire the setting in Phase 2 even if we
don't flip it on by default. Future Track 2 deployments can flip it
without a code change.
### Observability and SLOs
**What it would require:**
- Prometheus metrics endpoint (separate port from the monitor, no
auth needed on metrics, loopback or behind a scraper only).
- Counters: jobs_processed, jobs_failed, frames_processed, bytes_read,
bytes_written. Histograms: per-frame latency, per-job latency,
per-video latency. Gauges: queue depth, active job count.
- Tracing: OpenTelemetry instrumentation on job_process,
detect_poses, save_job_results. Again, the interesting spans are
the long ones, so trace-sampling at 100% is usually fine until
throughput matters.
- Defined SLOs: "99% of jobs complete within 10× video duration,"
"95% of monitor requests return in under 100 ms," etc.
SLO definitions go into a `docs/slos.md`; burn-rate alerting is
the operational half.
**Order-of-magnitude** dependency: none of this is useful without
Track 2 demand. A single-user research Mac doesn't have SLOs.
### Supply-chain attestation and signed releases
**What it would require:**
- SBOM generation on every release (CycloneDX or SPDX format,
attached to the GitHub release and published alongside the wheel).
- Signed releases: sigstore / cosign signatures on the wheel, the
Docker images, and the source tarball. GitHub's OIDC +
sigstore makes this a ten-line workflow once. For a clinical tool,
a reviewer being able to verify "this wheel is the one GitHub
Actions produced from this commit" is non-negotiable.
- Reproducible builds: same source → same wheel hash. Python wheels
are usually reproducible with `SOURCE_DATE_EPOCH` set and `.pyc`
exclusion; document the exact command.
- Provenance attestations (SLSA level 2 or 3) for the CI pipeline.
GitHub's `attestations/build-provenance` action does this.
**Cheapest Phase 2 precaution:** wire sigstore signing into the
release workflow when it's first built (see Phase 2 release workflow
section). Free after the initial setup.
### Deployment orchestration
**What it would require:**
- Kubernetes manifests (Helm chart, probably). Pod specs for the
daemon, the monitor, the HTTP API. Separate deployments so they
can scale independently.
- Terraform or Pulumi for the underlying infrastructure: GPU
node pool, object storage, IAM, TLS termination. Site-dependent;
Brown runs primarily on-prem with some AWS — the IaC would need
to target both.
- Secrets management: Vault, AWS Secrets Manager, or K8s
Secrets + External Secrets Operator. The monitor token, the
broker credentials, the object-storage keys all need to stop being
env vars in a `.service` file.
**Strong recommendation:** do not write any of this until there is
a specific deployment with specific operators. Generic K8s manifests
written without a target are a solution in search of a problem, and
they age fast.
---
## Decisions to not prematurely foreclose
A short list of choices we should avoid making in Phase 0 or Phase 2
that would make Track 2 more expensive later:
1. **Keep `neuropose.ingest` and `neuropose.interfacer` API-stable
as Python modules.** A future HTTP layer should be able to import
them. Avoid adding `@staticmethod` decorators that hide internal
state; avoid coupling to global config.
2. **Keep the filesystem layout reversible.** Anything in
`$data_dir` that is not a user artifact should be treated as
internal. If Track 2 wants to replace the filesystem with an
object store, the daemon's only file I/O should be via
`neuropose.io` helpers — no raw opens scattered through the code.
3. **Keep `VideoPredictions.provenance` extensible.** The Phase 0
`Provenance` model should be a pydantic model so fields can be
added backward-compatibly. Don't pack provenance into free-form
strings or nested dicts that require bespoke parsing.
4. **Keep the CLI subcommands orthogonal.** Do not add subcommands
that wrap multiple other subcommands for convenience; that
creates API shape we'd regret if the right composition layer
later is HTTP, not shell.
5. **Keep model loading behind `neuropose._model`.** A future
self-hosted model registry, signed-artifact verification, or
multi-model switching should be a change in one file, not a
refactor across the estimator.
6. **Keep `Settings` the single source of truth.** No `os.environ`
reads outside pydantic-settings; no sprinkled `Path.home()`
calls. Track 2 almost certainly overrides configuration from
a secret store, and if that override has one place to hook in,
it's easy.
7. **Keep status-file schema owned by pydantic, not hand-written
JSON.** Track 2 multi-tenancy means indexing into the status
file by tenant; a pydantic model refactor is cheap, a
hand-written dict refactor is not.
8. **Keep the `AnalysisConfig` shape additive.** The Phase 0 YAML
schema will evolve through Phase 1 as Paper C's experiments
surface needs. Additions are free (new optional fields);
renames and removals invalidate prior experiments. Pydantic's
`extra="forbid"` catches typos at parse time while still
allowing additive extension.
These are cheap-now / expensive-later items. Every other Track 2
decision can wait for a real trigger.