54 KiB
NeuroPose Technical Ideation Notes
A living engineering roadmap, parallel to RESEARCH.md. Where
RESEARCH.md captures open methodological questions (DTW, skeleton
choice, hosting the model), this document captures open engineering
questions — release readiness, operability, scaling — and the paths
they could take.
This is not user-facing documentation. Items here are candidates for future work, and inclusion does not imply commitment.
How to use this document
- Add a section when you start thinking about a new area of technical investment.
- Each section should end with a Scope, Sketch, or Open questions block so it's obvious to a future you (or a new contributor) what the concrete next move would be.
- When an item in here is decided and implemented, move it to the
relevant place in
docs/or in the code itself, and leave a short pointer behind (Seedocs/deployment.mdfor the resolved design.). - The audience is anyone maintaining the codebase — Levi, David, Praneeth, Dr. Shu, and whoever comes after us. Assume competence in Python and systems work; don't assume familiarity with our specific tooling choices.
Three phases, then a contingent track
There are four distinct technical objectives, ordered by timeline and by what each enables next. The sequencing is deliberate: each phase unblocks the next, and doing them in any other order either publishes Paper C on top of a pipeline its own design notes disavow, or delays the open-source release past the window where the accompanying paper is still salient.
-
Phase 0 — C-enabling pipeline work. A targeted subset of engineering work that has to land before Paper C can start. The DTW defaults shipped in 0.1 are explicitly a "mechanical port, not a methodological choice" (see
RESEARCH.md§1); running the clinical validation study on them would mean publishing results from a pipeline the accompanying design notes explicitly criticize. Phase 0 fixes the analyzer's methodological foundations (Procrustes preprocessing, cycle segmentation, joint-angle DTW representation), locks in the reproducibility surface (Provenancesubobject, YAML-configurable analysis pipeline), and sets up schema migration so data generated during Phase 1 survives the long write-up. Near-term, well-scoped, weeks of work. -
Phase 1 — Paper C: clinical validation study. The planned clinical-methods paper: cycle-aware joint-angle DTW for clinical gait similarity, validated against MoCap ground truth and/or clinician ratings. Gated on MoCap data access via Dr. Shu. This is research work, not engineering work — this document describes the engineering scaffolding around it, not the paper itself. Phase 2 work can happen in the background during this phase as ideal filler for research-burnout cycles. Months; timeline driven by data access and experimental design.
-
Phase 2 — Coordinated open-source release + Paper A. The engineering-paper companion (A) describing the tech stack, plus the tagged 0.1 release: PyPI publication, docs deployment, Docker images, CI matrix, supervision artifacts, doctor preflight, all the operational items that make the tool credible to external users. Timed to arrive with or slightly before Paper C's submission, producing a paper-plus-tool bundle that reviewers can actually run. Weeks of work, timing driven by Paper C's submission window.
-
Track 2 — Clinical platform (contingent). Everything beyond the open-source research tool — multi-tenancy, audit logging, HTTP/API layer, clinician UI, clinical-system integrations. Not sequenced; activates only if specific triggers fire (external demand, multi-site ambition, funding mandate, publication traction). Most of this is background thinking, not planned work. The value of keeping it in this document is so that Phase 0 and Phase 2 decisions don't accidentally foreclose Track 2 options.
Phases 0 → 1 → 2 form a near-term sequence that culminates in a paper-plus-release bundle. Track 2 sits outside that sequence and does not gate any of it.
Contents
- Phase 0: C-enabling pipeline work
- Phase 1: Clinical validation study (Paper C)
- Phase 2: Coordinated open-source release + Paper A
- Release definition
- Apple Silicon CI matrix
- Mac hardware validation pass
- Retention and pruning
- neuropose doctor preflight
- Process supervision artifacts
- Structured logging option
- Monitor authentication
- Docker GPU image
- Dependency freshness automation
- Release workflow
- Error-path test coverage expansion
- Track 2: Clinical platform (contingent)
- Triggers to activate Track 2
- Multi-tenancy and identity
- Audit logging and compliance posture
- HTTP/API layer
- Clinician-facing UI
- Horizontal scaling
- Backup, replication, and data durability
- Clinical-system integrations
- Deterministic inference mode
- Observability and SLOs
- Supply-chain attestation and signed releases
- Deployment orchestration
- Decisions to not prematurely foreclose
Phase 0: C-enabling pipeline work
The six items below are prerequisites for Paper C. Until they are
landed, every analysis C would produce would be running on defaults
that RESEARCH.md §1 explicitly flags as provisional. Ship these
first, in any order that suits the implementer's cadence, and the
rest of the project can pick up with confidence that Phase 1 results
are trustworthy.
Procrustes preprocessing
Status: Not implemented. neuropose.analyzer.features ships
extract_joint_angles and feature-statistics helpers; no alignment
step exists between pose sequences.
Why it matters for Paper C: without alignment, DTW distance is translation- and orientation-dependent. Two recordings of the same subject from different camera positions produce different distances, which is almost never what a clinician wants. Paper C's methods section would need to apologize for this in print; cheaper to fix the method than to defend it.
Scope:
- Add
procrustes_align(a: np.ndarray, b: np.ndarray, *, mode: Literal["per_frame", "per_sequence"]) -> tuple[np.ndarray, np.ndarray, AlignmentDiagnostics]toneuropose.analyzer.features. Implements the Kabsch algorithm (closed-form optimal rigid transform). Per-frame aligns each frame of A to the corresponding frame of B independently; per-sequence computes one transform over the whole sequence. Both are useful — per-frame for fine-grained matching, per-sequence for preserving within-trial dynamics. - Return aligned arrays plus an
AlignmentDiagnosticsdataclass with the fitted rotation magnitude and translation magnitude so downstream code can flag suspiciously large transforms (usually a sign of upstream annotation error). - Expose as an opt-in
align: Literal["none", "procrustes_per_frame", "procrustes_per_sequence"] = "none"parameter on every DTW entry point inneuropose.analyzer.dtw. Defaultnonepreserves current behavior; Paper C's pipeline sets it toprocrustes_per_sequence. - Unit tests: construct a known rotation + translation between two synthetic skeletons, verify alignment recovers it to within floating-point precision; verify alignment of a sequence with its own translated copy produces zero residual.
Non-scope:
- Non-rigid alignment (thin-plate splines, learned registration). Not needed for skeleton-level comparison and would be a research contribution on its own.
Open question: should alignment also include optional scaling
(scaled-Procrustes / full Procrustes)? For cross-subject comparison
it almost certainly should. Default to scale-preserving and add a
scale: bool = False flag; Paper C can flip it on for cross-subject
figures.
Gait cycle segmentation
Status: segment_by_peaks in neuropose.analyzer.segment
performs generic valley-to-valley segmentation on a supplied 1D
signal. There is no gait-specific wrapper that knows to look at the
heel's vertical coordinate.
Why it matters for Paper C: clinical gait analysis wants to compare the 4th heel-strike of trial A to the 4th heel-strike of trial B, not frame 120 of A vs frame 120 of B. Per-cycle DTW is the standard approach in the biomechanics literature (Sadeghi et al. 2000 and descendants); running full-trial DTW on gait is a choice reviewers of Paper C would correctly flag as methodologically weak.
Scope:
- New
segment_gait_cycles(predictions: VideoPredictions, *, joint: str = "rhee", axis: Literal["x", "y", "z"] = "y", min_cycle_seconds: float = 0.4) -> Segmentationinneuropose.analyzer.segment. - Under the hood: extract the specified joint's coordinate along the
specified axis, apply
segment_by_peakswith appropriate distance and prominence thresholds (derived frommin_cycle_secondsviapredictions.metadata.fps), return the resultingSegmentation(the existingneuropose.io.Segmentationtype) so downstream tooling picks it up unchanged. - Two-sided detection: run the same detection on the opposite heel
and return both per-side segmentations under named keys
(
left_heel_strikes,right_heel_strikes). Clinical users will want both. - Allow the reference joint and axis to be configurable so trials recorded with a different camera orientation (lateral vs frontal vs oblique) can still be segmented without a code change.
Non-scope:
- HMM-based cycle detection, learned cycle detectors. Peak detection on vertical coordinate is standard, well-understood, and the method the biomechanics literature expects to see.
- Handling pathological gaits where heel-strikes are absent
(shuffling, walker-assisted). The function should degrade
gracefully (return a
Segmentationwith an empty list, not raise), and Paper C's data-quality filtering handles the rest.
Open question: should the function also emit a "confidence" per cycle (prominence of the detected peak, regularity of spacing) that Paper C can use to filter out low-quality detections? Cheap to add, useful downstream. Recommend yes.
Joint-angle DTW representation
Status: dtw_all, dtw_per_joint, and dtw_relation operate on
raw 3D coordinates or joint-pair displacements. extract_joint_angles
produces per-frame angle sequences but is not wired as a DTW input.
Why it matters for Paper C: angle-space DTW is translation- and rotation-invariant by construction, scale-invariant if normalized, and directly interpretable in clinical terms ("knee flexion angle during swing phase"). Paper C's headline figures almost certainly use angle-space distances; raw coordinates would draw the obvious reviewer question of why we aren't comparing the thing clinicians actually measure.
Scope:
- Add
representation: Literal["coords", "angles", "relation"] = "coords"to every DTW entry point. Thecoordsdefault preserves existing behavior;anglesrunsextract_joint_angleson each input first;relationis the existingdtw_relationpath expressed as a representation choice rather than a separate function (leaving thedtw_relationname as a convenience wrapper if preferred). - Degenerate-vector handling:
extract_joint_anglesreturns NaN for degenerate (zero-length) vectors. The DTW path needs to decide how to handle NaN — skip-and-interpolate, drop, or propagate to the distance. Propagation is safest (makes the problem visible); interpolation is what clinical users probably want day-to-day. Default to propagation and exposenan_policy: Literal["propagate", "interpolate", "drop"]for experimentation. - Tests: synthetic pair with known angular difference, assert DTW in angle-space recovers it independent of global rotation applied to the input.
Non-scope:
- Quaternion or SO(3) rotation-space DTW. Interesting but requires a rotation parameterization the current skeleton output does not produce.
- Mixed-representation (position + angle concatenated, learned embeddings). These are experiments Paper C might run; they don't belong in Phase 0 infrastructure.
Provenance subobject
Status: PerformanceMetrics captures tensorflow_version,
active_device, and tensorflow_metal_active. Model SHA is not
computed or propagated. numpy_version and neuropose_version are
not recorded. No first-class Provenance object.
Why it matters for Paper C: reproducibility is the first
question a reviewer asks of a clinical-methods paper. The answer
needs to be "same model artifact, same pipeline config, same
versions, same seeds" — and all four need to be recorded on every
results.json that underlies a paper figure. Not having this means
either manually tracking it in a lab notebook (fragile, won't
survive personnel turnover) or running every experiment through a
pinned Docker image (expensive, doesn't capture runtime
non-determinism). The subobject is the cheap right answer.
Scope:
- New
Provenancepydantic model inneuropose.iowith fields:model_sha256: str,model_filename: str,tensorflow_version: str,tensorflow_metal_version: str | None,numpy_version: str,neuropose_version: str,python_version: str,seed: int | None,deterministic: bool,analysis_config: dict | None(the YAML of the run if the pipeline was invoked vianeuropose analyze --config). - Optional
provenance: Provenance | None = Nonefield onVideoPredictions,JobResults, andBenchmarkResult. None-valued on legacy files (enabled by schema migration — see below), populated on every new write. _model.pyhashes the downloaded tarball on first load (after the existing SHA verification — the two checks use the same hash so compute is amortized) and exposes the hash via aget_model_sha256()method on theEstimator.Interfacer._run_job_innerconstructs theProvenanceand attaches it to the output.- Unit test: serialize → JSON → deserialize round-trip identity;
assert
model_sha256matches the SHA recorded inneuropose._model.
Non-scope:
- Cryptographic signatures on results.json. That's Phase 2 (sigstore on release artifacts) or Track 2 (per-output signing) territory, not Phase 0.
- Provenance on arbitrary intermediate products (numpy arrays, DTW distance matrices). Top-level JSONs cover Paper C's needs; richer intermediates can inherit from a hand-off if needed.
Open question: does Paper C need per-frame provenance (which frame was processed with which configuration) or just per-job provenance? Per-job is enough for reproducibility; per-frame is only useful if we want to mix configurations within a single job, which has no current use case.
YAML-configurable analysis pipeline
Status: neuropose.cli's analyze subcommand is a stub that
raises NotImplementedError. Analysis operations are called
individually from Python, or via CLI flags on segment and
benchmark. No unified representation of "a complete analysis run."
Why it matters for Paper C: the paper will run many experimental
configurations — alignment on/off, per-frame vs per-sequence, raw
coordinates vs joint angles, full-trial vs cycle-segmented DTW,
various distance metrics. Each experiment should be reproducible
from a single file that can be version-controlled, diffed, attached
to the Provenance object, and cited in the paper. A Python script
full of kwargs is the alternative, and it's exactly the alternative
the open-source community collectively decided against ten years ago.
This item also resolves the "neuropose analyze: ship or remove"
question that was previously open: we are shipping analyze, just
specifically in a YAML-driven form. The stub that currently exists
becomes the real command in Phase 0.
Scope:
AnalysisConfigpydantic model inneuropose.analyzercapturing the full pipeline: input source (predictions file path), preprocessing (align,normalize,segment), per-segment analysis (DTW backend, representation, distance function, extra kwargs), output (figures, statistics, distance matrices).- Parseable from YAML via pydantic; validated on parse so typos in field names fail early with a clear error.
neuropose analyze --config experiment.yaml [--output results_042.json]runs the pipeline end-to-end. The config YAML is serialized into the resultingProvenance.analysis_config, so the output file is self-describing.- Ship three or four example configs under
examples/analysis/that exercise the full matrix of alignment × representation × segmentation choices Paper C will care about. Double as integration tests.
Non-scope:
- A DAG / workflow engine (Snakemake, Nextflow). A flat config is enough for Paper C's needs; reach for a DAG tool only when experiments have genuine inter-stage dependencies, which analysis of a single video does not.
- Parallel sweep execution. Run multiple configs via a shell loop
for now (
for cfg in examples/analysis/*.yaml; do neuropose analyze --config "$cfg" --output "out/$(basename "$cfg" .yaml).json"; done). A real sweep orchestrator is Track 2.
Open question: should there be a neuropose analyze compare <config_a.yaml> <config_b.yaml> subcommand that runs both and
emits a diff figure? Useful for Paper C but not a gating feature —
post-Phase-0 addition if the need is clear.
Schema migration for VideoPredictions
Status: VideoPredictions gained segmentations: dict[str, Segmentation] = Field(default_factory=dict) during recent work. Old
JSON files without the field still load (pydantic default-factories
back-fill), but this is accidental rather than designed-in.
Why it matters for Paper C: Paper C will produce analysis results
over the course of 6-12 months. During that window, Phase 0 work
itself will evolve — the Provenance object will gain fields, the
AnalysisConfig shape will stabilize, maybe the Segmentation schema
will extend. Without migration support, every schema change would
invalidate some portion of Paper C's already-generated data, forcing
either a freeze (drops velocity) or a full re-run (wastes compute).
Migration now is the cheap fix.
Scope:
- Add a
schema_version: int = 1field toVideoPredictions,JobResults, andBenchmarkResult(the three load-anywhere top-level schemas). - Write
migrate_video_predictions(payload: dict) -> dictthat takes a raw JSON-loaded dict, dispatches onschema_version, and returns a dict conformant with the current version. Default to 1 when missing (existing files). - Wire it into
load_video_predictions()so the migration runs before pydantic validation. Log at INFO on migration so users see when files are being upgraded. - When writing, always write the current version.
Non-scope:
- A general-purpose migration framework. A function that dispatches on an integer is sufficient until we have three versions.
- In-place migration (writing back the upgraded file). Migrations should run on read; write-back is a separate operator decision.
Phase 1: Clinical validation study (Paper C)
Phase 1 is Paper C itself — the clinical-methods paper this project
exists to produce. The content belongs in the paper, in RESEARCH.md,
and in the analysis-config YAMLs under examples/, not here. This
section exists only to demarcate the phase and to capture the
engineering commitments that should (and should not) happen during it.
Engineering posture during Phase 1:
- Phase 0 is frozen on entry. Don't refactor the analyzer during
Phase 1; refactors invalidate earlier experiments. If a Phase 0
shortcoming surfaces during paper-writing, log it in
RESEARCH.mdand revisit after submission. - Phase 2 work is welcome as background. Writing a launchd plist, wiring up Dependabot, tightening error-path tests — all of this is ideal filler work during the experimental-design and writing phases of Paper C. It consumes different energy than research work does, and the tool is in better shape on submission day as a result.
RESEARCH.mdgets the bulk of the updates. Methods decisions, reading-list expansions, reviewer-response notes all live there, not here.- Do add engineering-side notes here when a Paper C experiment reveals a piece of missing tooling that's worth a Phase 2 item (for example: "we needed batch-analysis across 200 trials and hit this, so Phase 2 should include ..."). Phase 1 is the best possible source of prioritization signal for what Phase 2 is actually worth.
Prerequisite outside this document: a MoCap-data-access
conversation with Dr. Shu. Nothing in Phase 1 can start until that
conversation has resolved. RESEARCH.md §3 flags this as the
gating question for fine-tuning; it is equally the gating question
for validation.
Phase 2: Coordinated open-source release + Paper A
Phase 2 is the release. Its content is exactly the items listed here — the engineering work to take the Phase-0-plus-Phase-1 codebase to a state where an outside researcher can pick it up, install it, run it, verify its claims, and cite it. It runs concurrently with the tail end of Phase 1 (see posture notes above) and culminates in a coordinated drop: tag → PyPI → Pages → arXiv / JOSS submission for Paper A → reference in Paper C's Code Availability section.
Release definition
Before enumerating the remaining work, define what "released" means. A release candidate should satisfy all of the following:
- Installable on a blank machine.
pip install neuroposeoruv pip install neuroposeworks on both Linux x86_64 and Apple Silicon Mac, with no manual steps beyond Python 3.11. - Runnable without the author in the room. The
docs/site is published somewhere persistent (GitHub Pages, Cloudflare Pages), the getting-started walkthrough actually works end-to-end, and the MeTRAbs model downloads and verifies on first run. - Verifiable by a reviewer. CI runs on every push, covers both Linux and macOS, and a PR from a stranger could be meaningfully reviewed without access to the research Mac.
- Honest about its limits. Every surface the release advertises
is either exercised in CI or clearly marked experimental. No
false promises in the README or CLI help text. (The
analyzestub that motivated this item pre-Phase-0 is now real per Phase 0's YAML pipeline, so "ship or remove" is no longer open.) - Versioned. A git tag exists,
__version__matches, andCHANGELOG.mdhas a real release section, not just[Unreleased]. - Bundled. Paper A (tech-stack writeup) and Paper C (clinical validation) cite the release tag, and the release notes cite them. The three artifacts arrive together; reviewers of either paper can find and run the code.
Items below are the gaps between the end-of-Phase-0 state and that definition.
Apple Silicon CI matrix
Status: RESEARCH.md lists this as an open next step; no
macos-14 entry in .github/workflows/ci.yml.
Why it matters for release: every claim of "Apple Silicon
support" is currently "by construction" — the TF 2.16+ floor ships
darwin/arm64 wheels, the MeTRAbs SavedModel has zero custom ops, and
therefore it should work. It has not been empirically confirmed on
real hardware in an automated way. For a public release, we either
verify in CI or we stop claiming Mac support in the README.
Scope:
- Add a
macos-14matrix entry to thetestjob (lint and typecheck stay single-platform, they're platform-independent). - Exclude
slowmarkers on macOS so we don't pay the 2 GB model download twice per run. - Accept that the first green macOS run may require two or three
hotfixes — path case sensitivity,
multiprocessingspawn vs fork, shared library load order — and budget a day for that. - Do not add a Metal runner. GitHub's
macos-14runners don't expose the GPU to TensorFlow in a useful way, and the[metal]extra's numerical verification is a separate task that needs real M-series silicon we control.
Sketch:
test:
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-14]
runs-on: ${{ matrix.os }}
Everything else in the job stays the same; uv works identically on
both platforms.
Mac hardware validation pass
Status: Unexercised. The Shu Lab research Mac (100.64.15.110) is
available; we have an rsync script but no cron job, no automated
smoke check, no numerical-divergence report against the Linux
baseline.
Why it matters for release: CI on GitHub's macos-14 runners
validates that the wheels install and the tests pass on Apple
Silicon. It does not validate that the real MeTRAbs model loads, that
inference runs, or that poses3d on the Mac matches poses3d on
Linux within a sane tolerance. Those are different questions, and
answering them against a throwaway runner each time would be wasteful
and unreliable.
A minimum version of this check — "does detect_poses produce
output on the research Mac at all?" — should happen during Phase 0
regardless, because Paper C will likely run on the same hardware and
a silent numerical divergence there would invalidate the paper's
results. The scope below is the full, release-grade version.
Scope:
- Run
neuropose benchmark --compare-cpuagainst a reference clip on the research Mac. Capture the resultingBenchmarkResultJSON. - Commit the JSON as
benchmarks/reference/mac_m3_ultra_cpu_v0_1.json(a tracked file, not gitignored — this is the reference numerics we'll compare against going forward). - Separately, run the
[metal]path and diff. Record inRESEARCH.mdwhether divergence is within the ~1e-2 mm budget the research notes propose, or whether the Metal path is in the "use at your own risk" column. - Document the findings as a new section in
RESEARCH.md("Apple Silicon verification, 2026-0X") and close the corresponding open-question entry.
Open question: should the reference JSON become a test input (slow-marked integration test that re-runs benchmark on a developer's machine and asserts divergence from the committed reference), or just documentation? The former catches regressions automatically at the cost of a 2 GB model download in the slow job; the latter is cheaper but easier to ignore.
Retention and pruning
Status: out/ and failed/ grow forever. No retention config.
No neuropose prune command.
Why it matters for release: a research Mac running the daemon unattended for months will fill its disk. The first support request will be "the daemon just stopped working" and the answer will be "you ran out of disk." We can solve this once now, or a hundred times later.
Scope:
- Add a
retention_days: int | None = Nonesetting (default None = disabled, preserving current behavior). - When set, the daemon checks on each poll whether any job in
out/orfailed/is older than the threshold and removes it. The correspondingstatus.jsonentry transitions to a newPRUNEDstate (keeping the audit trail) or is removed entirely (keeping the status file small) — pick one and document. - Ship a
neuropose prune [--older-than N] [--dry-run]one-shot command for operators who want manual control. - Document in
docs/deployment.mdwith a recommended default (30 days feels right for benchmark/iteration workflows; clinical deployments would be legal-driven and much longer).
Open question: should pruned jobs' status.json entries be
preserved as tombstones (so a user asking "where did job X go?" can
see "pruned 2026-05-01") or removed entirely? Tombstones are more
user-friendly; removal keeps the status file bounded. Default to
tombstones since the status file bound is only a problem at a scale
the 0.1 release won't hit.
neuropose doctor preflight
Status: Not implemented.
Why it matters for release: pydantic-settings validates the
schema of Settings (is device a valid string, is
poll_interval_seconds positive). It does not validate the
environment — is data_dir writable, is the lock file acquirable,
is model_cache_dir on the same filesystem as data_dir (so
os.rename works atomically), is the configured TF device actually
available. Each of those is a runtime failure mode that shows up with
an ugly traceback ten seconds after neuropose watch starts, and
every one is cheaply detectable at startup.
Scope:
- New subcommand
neuropose doctorthat runs a battery of preflight checks and prints a pass/fail table. - Checks to include:
data_direxists and is writable; lock file acquirable (with clean release); all three subdirectories (in/out/failed) writable;model_cache_dirwritable and on the same filesystem asdata_dir; TF is importable; configureddeviceis intf.config.list_physical_devices();tensorflow-metaleither absent or installed with a version that advertises support for the installed TF; XDG envvars are sane; Python version matchespyproject.tomlfloor. - Exit code 0 if all checks pass, 1 if any warning, 2 if any fatal failure.
- The daemon's
run()entry point calls the same underlying preflight function before entering the poll loop, sowatch-without-doctor still gets the benefit.
Non-scope:
- Do not check for network access to the MeTRAbs download host. Network-dependent checks make CI flaky and don't match the offline caching behavior of real operators.
Process supervision artifacts
Status: docs/deployment.md documents a systemd user unit as
text in prose. No file in scripts/ that a user can actually copy.
No macOS launchd plist at all.
Why it matters for release: copy-paste from a docs page into a
.service file works, but it's friction. An open-source project with
"here is the file, here is where it goes, here is the enable command"
ships deployments faster.
Scope:
- Ship
scripts/systemd/neuropose.serviceas a file with%hplaceholders and a short install README. - Ship
scripts/launchd/org.levineuwirth.neuropose.plistas a file with an install README. (Consider making the plist label match the reverse-DNS of whoever is hosting — either the lab's ororg.neuropose.daemonfor a vendor-neutral identity.) - Optional: a
scripts/install_service.shthat detects the platform and runs the right install command. Probably not worth the complexity; a five-line README section per platform is fine.
Non-scope:
- Do not write installers for init systems we do not personally run (upstart, sysvinit, runit). If someone needs those, the systemd unit gives them enough of a template.
Structured logging option
Status: Everything logs to stderr via logging.basicConfig
with a human-readable formatter.
Why it matters for release: the current format is correct for
interactive use. For any consumer that wants to feed the daemon's
output into Loki, Splunk, Grafana, Datadog, or even jq-based
aggregation, JSON-per-line would eliminate a parsing step. This is
a near-free feature if added now and a disruptive formatting change
if added later. It is also a prerequisite for any Track 2
audit-logging work, so building it now keeps Track 2 options open at
near-zero cost.
Scope:
- Add a
--log-format={human,json}global CLI option defaulting tohuman. - Implement the
jsonvariant as a formatter that emits{"ts": ..., "level": ..., "logger": ..., "message": ..., ...}per line with no log-line wrapping. - Wire it through
_configure_logging()so every subcommand benefits identically.
Open question: do we also want log correlation IDs per job? That's a bigger change (pushing a context var through the Interfacer's call stack) and probably Track 2 — skip for 0.1.
Monitor authentication
Status: The monitor binds to 127.0.0.1:8765 by default. No
auth, no tokens. --host 0.0.0.0 works but has a comment warning the
operator to think.
Why it matters for release: loopback-only is a reasonable default, but the monitor is specifically marketed as the thing collaborators can watch. "Collaborator" implies a browser somewhere other than the daemon host. The "correct" answer (TLS, real auth) is too expensive for 0.1; the "wrong but acceptable" answer (no auth, so anyone who can reach the port sees everything) is what we have now. There's a middle ground.
Scope:
- Add an optional
monitor_token: str | None = Nonesetting. - When set, every request to
/and/status.jsonmust carry?token=<value>in the query string orX-Status-Tokenin the header. No token → 401. neuropose serveprints a URL including the token on startup, so operators can copy-paste it. Ifmonitor_tokenis unset, behavior is unchanged.--host 0.0.0.0emits a stderr warning ifmonitor_tokenis unset — don't block it, just flag it.
Non-scope:
- TLS. Use a reverse proxy (Caddy, nginx,
ssh -L) for any internet-facing exposure. The monitor is not the right place to terminate TLS. - Multi-user auth, session cookies, anything with a database. That's Track 2.
Docker GPU image
Status: Dockerfile exists (CPU-only). Dockerfile.gpu
mentioned in CHANGELOG as planned.
Why it matters for release: a single-file CUDA deployment story
reduces "can I run this on our lab server?" from a 45-minute dance
with conda and CUDA versions to one docker run. For Linux GPU
users this is the friction difference between trying the project and
bouncing.
Scope:
- Write
Dockerfile.gpuon top ofnvidia/cuda:12.x-runtime-ubuntu22.04(pick the version TF 2.18 actually supports — check thetensorflow-gpucompat matrix, not just "latest"). - Multi-stage: build stage has
uvand builds the venv; final stage just copies the venv and sets entrypoints. - Add a
docker-build.ymlCI workflow that builds both images on every push to main and publishes asghcr.io/neuwirth/neuropose:cpuand:gpu(or wherever the project ends up hosted). - Document in
docs/deployment.mdwith adocker run --gpus allexample.
Non-scope:
- A
tensorflow-metalDocker image. Mac can't virtualize Metal, so there's no point.
Dependency freshness automation
Status: No Dependabot, no Renovate. Everything floats until
somebody notices. The recent TF cap tightening (<2.19) was caught
manually because a user happened to ask; a scheduled bot would have
flagged it weeks earlier.
Why it matters for release: security CVEs on transitive dependencies land every few weeks. Without automation, they get discovered by a downstream user trying to install into an audited environment. With automation, they become a PR you either merge or explicitly decline.
Scope:
- Add
.github/dependabot.ymlwith groups:python-prod,python-dev,github-actions. Weekly schedule. Ignoretensorflowupdates until manually cleared (thetensorflow-metalconstraint means auto-bumping TF is destructive). - Alternative: Renovate via
renovate.json. Renovate has better grouping and scheduling, Dependabot is simpler and needs no setup on GitHub. For an open-source Brown-lab project, Dependabot is enough. - Add
uv lock --upgrade-package <name>to the dev playbook indocs/development.mdso PR authors know how to re-lock.
Release workflow
Status: [project.scripts] is wired for pip install, but no
tag-triggered publishing pipeline. .github/workflows/docs.yml
uploads the built docs as a 14-day artifact, not to Pages.
Why it matters for release: "release" without a repeatable publishing flow is a synonym for "one-off person runs hatch build on their laptop at 11pm before the paper deadline." That is not a release.
Scope:
.github/workflows/release.ymltriggered on version tags (v[0-9]+.[0-9]+.[0-9]+). Steps: check version matches__version__; build withhatch build; publish to PyPI via trusted publisher (no long-lived token); create GitHub release with changelog excerpt.- Flip
docs.ymlto deploy thesite/output to GitHub Pages on every push tomainonce the repo is public. Pin the Pages URL in the README and insite_urlinmkdocs.yml(already points atlevineuwirth.github.io, but verify). - Sign tags with GPG; document the key fingerprint in
SECURITY.md(which does not yet exist; create it). - Consider wiring sigstore signing at the same time — see Track 2 supply-chain section. Free after the initial setup and buys everything Track 2 would want without committing to the rest of that track.
Open question: do we publish under neuropose, brown-neuropose,
or something else on PyPI? Whichever name, squat it before the paper
drops — waiting means risking namesquatter abuse.
Error-path test coverage expansion
Status: Happy paths and a handful of input-validation errors covered. Not covered: disk full mid-write, corrupt video mid-decode, OOM during inference, fcntl.flock on NFS (no-op on some kernels), truncated zip archives, permission denied on data_dir subdirectories.
Why it matters for release: shipping a tool where "happy path works" is different from shipping a tool where "when it fails, it fails predictably." For a clinical research pipeline where a crash mid-job quarantines valuable recording data, fault tolerance is a feature.
Scope:
- Systematic pass: for each module, write a
test_<module>_failure_modes.pyenumerating the specific exception classes that can escape and the corresponding test case that triggers each one. Usepytest.raiseswith the exact expected exception class. - Hardest cases use fixtures that monkeypatch system calls
(
os.writeraises OSError(ENOSPC),cv2.VideoCapture.readreturnsFalse, Nonepartway through,fcntl.flockraises OSError(EBADF)). - Aim: every user-facing error message in the codebase has a test that proves it's reachable.
Non-scope:
- Chaos-engineering frameworks.
monkeypatchis enough. - Covering unrecoverable errors like SIGKILL of the daemon mid-frame. That's the recovery-on-startup test, which already exists.
Track 2: Clinical platform (contingent)
Track 2 is everything beyond the open-source research tool — multi-tenancy, audit logging, HTTP/API layer, clinician UI, clinical-system integrations, the works. None of it is sequenced with Phases 0–2; all of it is gated on specific triggers that don't exist yet.
Triggers to activate Track 2
Do not start Track 2 work until at least one of the following is true:
- External demand. Another clinical group has asked for a deployment they can run independently. Not a casual "interesting project" — a specific ask with a specific cohort and a specific timeline.
- Multi-site ambition. The Shu Lab decides to run NeuroPose across more than one site within Brown-affiliated clinical systems, and the single-host assumption stops working.
- Funding mandate. A grant or contract specifies outputs that the Phase 0-1-2 deliverables cannot meet (e.g. "produce a HIPAA-compliant deployment," "integrate with the EHR").
- Publication traction. Papers A and C get engagement that translates into demand for a hosted version. Clinical-methods papers occasionally do. If enough unsolicited inquiries land, Track 2 becomes worth the investment.
Before at least one of these triggers: everything below is background thinking, not planned work. Do not refactor Phase 0 or Phase 2 code to make Track 2 easier. Every such refactor is a bet on a future that may not arrive.
Multi-tenancy and identity
What it would require:
- A concept of "user" distinct from "OS user." Today
Settings.data_diris one directory per OS user; multi-tenancy means onedata_dirserving many logical tenants with enforced isolation. - Per-tenant namespacing in
in/,out/,failed/, andstatus.json. Cleanest is one subdirectory per tenant with the same four-directory layout; the daemon's discovery logic becomes a two-level scan. - Authentication on the control plane. Passing tenant identity as a command-line arg is fine for a research prototype; a real deployment needs OAuth/OIDC or SAML with the institution's IdP (Brown CAS, epic Auth, whatever the target site uses).
- Authorization model: at minimum, "tenant A cannot see tenant B's jobs." For clinical deployments, probably also role-based (clinician / PI / admin / auditor).
Cheapest path forward if a trigger fires: fork the data-directory
layout into $data_dir/<tenant_id>/{in,out,failed,status.json},
teach the daemon to iterate tenants in its poll loop, add a
--tenant flag to the CLI. That's enough for an invitation-only
deployment where tenants are identified by opaque string and issued
out-of-band.
Expensive path: anything involving an identity provider. Don't go there without a real operator committing to the deployment.
Audit logging and compliance posture
What it would require:
- Append-only log of every data access, write, and configuration change, with actor identity and timestamp. Separate from the application log (which rotates).
- Logs streamed to a write-once sink (S3 with object-lock, immutable journal) so a compromised host can't rewrite the trail.
- Legal review: what exactly does HIPAA require of this tool? What about institutional IRB? The answer will differ across sites and the project cannot prescribe it — but the capability to generate the required logs needs to be built in.
- Retention policy wired to the audit log, not just application state. Pruning job results is different from pruning audit records.
Technical prerequisite: structured logging from Phase 2 (which is a cheap add and is scheduled anyway). Without JSON-per-line logs, audit extraction is a grep-and-pray regex problem.
HTTP/API layer
What it would require:
- Today the control plane is "write files to
in/." For a non-filesystem-native consumer (a hosted web UI, a batch scheduler, a Jupyter kernel in a different container), an HTTP API is the right abstraction. - FastAPI or Litestar on top of the existing ingest/interfacer/io modules. The daemon becomes a long-running process that serves requests and processes the input directory; or the daemon stays headless and the HTTP layer is a separate process talking via the same filesystem contract.
- OpenAPI schema published as part of the release so client code can be generated.
Non-obvious pitfall: the daemon's fcntl-based single-instance
lock assumes one writer. If the HTTP layer is a separate process, it
needs to go through the same ingest API, not directly into in/.
That's an easy discipline to establish if designed in from day one,
a painful refactor later.
Cheap Phase 0/2 precaution: keep neuropose.ingest and
neuropose.interfacer API-stable as Python modules. If a future
HTTP layer imports them, we don't want to break the import.
Clinician-facing UI
What it would require:
- More than the
neuropose servedashboard — an actual web application with clinician-facing views: patient list, session list, session-level pose visualization, comparison against reference motion, exportable reports. - Probably React + TypeScript on the frontend, consuming the HTTP API above. Backend-rendered templates would be faster to build but a worse fit for the per-session interaction model clinicians expect.
- WebGL or Three.js for 3D pose playback. The
neuropose.visualizemodule is a matplotlib-based still-frame tool; rebuilding it for interactive 3D is a weeks-to-months project on its own. - Accessibility: clinician environments include keyboard-only users, users on institutional IE holdovers (yes, still), users with screen readers. A research-grade UI ignores this; a clinical-grade one cannot.
Scope is enormous. This is the single largest piece of Track 2 and would likely dwarf all other Track 2 work combined. Would not start without dedicated frontend engineering effort.
Horizontal scaling
What it would require:
- A message broker (Redis Streams, RabbitMQ, or NATS) in place of the filesystem poll. Each job becomes a broker message; multiple worker processes consume and process in parallel.
- Shared storage for inputs and outputs (S3, MinIO, NFS). The "job_name is a directory" contract generalizes to "job_name is an object prefix."
- Per-worker GPU affinity for the multi-GPU case; worker auto-sizing based on queue depth.
- Distributed lock for the leader-only work (status file writes, retention enforcement).
Upgrade path that minimizes pain: the current single-process
daemon is equivalent to the "one worker" case of a horizontal
deployment. If the job object in neuropose.io stays the source of
truth (not the filesystem layout), the transition is backend-swap,
not architectural surgery. Keep that option open by treating the
filesystem as an implementation detail of Interfacer, not a public
contract.
Backup, replication, and data durability
What it would require:
- Outputs (
out/<job>/results.json) currently live on one disk on one host. For clinical data this is insufficient durability. - Replication target: another host (hot standby), object storage
(warm archive), or both. The
out/directory is the canonical store; replicating it periodically is a scriptable cron job today. - Proper replication: as writes happen, not as a cron. Either a
daemon-side hook that PUTs to S3 immediately after each
save_job_results, or a sidecar process watching the filesystem withinotify/fswatch. - Restore story: how do we restore
out/from backup without breakingstatus.json(which refers to job names by convention)? Test this annually.
Minimum viable backup for Phase 2: add a scripts/backup.sh
that rsyncs $data_dir/out/ to a configurable destination. Not a
feature; a paving-the-path-for-operators artifact.
Clinical-system integrations
What it would require:
- DICOM if videos are stored as DICOM instances rather than
MP4. Clinical motion-analysis devices increasingly output DICOM
video; reading DICOM means
pydicom+ some decoding logic. - FHIR for patient metadata. If NeuroPose is to accept a patient ID and attach it to a session, that ID probably comes from a FHIR Patient resource. Means speaking FHIR to the hospital's FHIR endpoint (Epic, Cerner).
- Redcap integration for clinical-research cohorts (the Brown ecosystem uses it heavily). An export script that pulls subject metadata from a RedCap project and lays it into the ingest directory is cheap and valuable.
Order of likely need: RedCap first (easy, valuable, Brown-local), then DICOM (depends on what the recording device outputs), then FHIR (only if we're pulling from an EHR, which we probably aren't for research).
Deterministic inference mode
What it would require:
- Phase 0's
Provenanceobject already captures model SHA, TF version, NumPy version, and a seed field. The missing piece for strict reproducibility is forcing TensorFlow itself to behave deterministically —tf.config.experimental.enable_op_determinism()plus seeding all ofrandom,numpy.random, andtf.random. - A
deterministic: bool = Falsesetting onSettingsthat flips the above. Default off, because deterministic mode costs a meaningful fraction of throughput on GPUs and isn't free on CPUs either. Clinical deployments would turn it on; benchmark runs would turn it off. - A
Provenance.deterministicboolean field is already in the Phase 0 scope; this item closes the loop by actually making that boolean mean something.
Cheap Phase 2 precaution: wire the setting in Phase 2 even if we don't flip it on by default. Future Track 2 deployments can flip it without a code change.
Observability and SLOs
What it would require:
- Prometheus metrics endpoint (separate port from the monitor, no auth needed on metrics, loopback or behind a scraper only).
- Counters: jobs_processed, jobs_failed, frames_processed, bytes_read, bytes_written. Histograms: per-frame latency, per-job latency, per-video latency. Gauges: queue depth, active job count.
- Tracing: OpenTelemetry instrumentation on job_process, detect_poses, save_job_results. Again, the interesting spans are the long ones, so trace-sampling at 100% is usually fine until throughput matters.
- Defined SLOs: "99% of jobs complete within 10× video duration,"
"95% of monitor requests return in under 100 ms," etc.
SLO definitions go into a
docs/slos.md; burn-rate alerting is the operational half.
Order-of-magnitude dependency: none of this is useful without Track 2 demand. A single-user research Mac doesn't have SLOs.
Supply-chain attestation and signed releases
What it would require:
- SBOM generation on every release (CycloneDX or SPDX format, attached to the GitHub release and published alongside the wheel).
- Signed releases: sigstore / cosign signatures on the wheel, the Docker images, and the source tarball. GitHub's OIDC + sigstore makes this a ten-line workflow once. For a clinical tool, a reviewer being able to verify "this wheel is the one GitHub Actions produced from this commit" is non-negotiable.
- Reproducible builds: same source → same wheel hash. Python wheels
are usually reproducible with
SOURCE_DATE_EPOCHset and.pycexclusion; document the exact command. - Provenance attestations (SLSA level 2 or 3) for the CI pipeline.
GitHub's
attestations/build-provenanceaction does this.
Cheapest Phase 2 precaution: wire sigstore signing into the release workflow when it's first built (see Phase 2 release workflow section). Free after the initial setup.
Deployment orchestration
What it would require:
- Kubernetes manifests (Helm chart, probably). Pod specs for the daemon, the monitor, the HTTP API. Separate deployments so they can scale independently.
- Terraform or Pulumi for the underlying infrastructure: GPU node pool, object storage, IAM, TLS termination. Site-dependent; Brown runs primarily on-prem with some AWS — the IaC would need to target both.
- Secrets management: Vault, AWS Secrets Manager, or K8s
Secrets + External Secrets Operator. The monitor token, the
broker credentials, the object-storage keys all need to stop being
env vars in a
.servicefile.
Strong recommendation: do not write any of this until there is a specific deployment with specific operators. Generic K8s manifests written without a target are a solution in search of a problem, and they age fast.
Decisions to not prematurely foreclose
A short list of choices we should avoid making in Phase 0 or Phase 2 that would make Track 2 more expensive later:
- Keep
neuropose.ingestandneuropose.interfacerAPI-stable as Python modules. A future HTTP layer should be able to import them. Avoid adding@staticmethoddecorators that hide internal state; avoid coupling to global config. - Keep the filesystem layout reversible. Anything in
$data_dirthat is not a user artifact should be treated as internal. If Track 2 wants to replace the filesystem with an object store, the daemon's only file I/O should be vianeuropose.iohelpers — no raw opens scattered through the code. - Keep
VideoPredictions.provenanceextensible. The Phase 0Provenancemodel should be a pydantic model so fields can be added backward-compatibly. Don't pack provenance into free-form strings or nested dicts that require bespoke parsing. - Keep the CLI subcommands orthogonal. Do not add subcommands that wrap multiple other subcommands for convenience; that creates API shape we'd regret if the right composition layer later is HTTP, not shell.
- Keep model loading behind
neuropose._model. A future self-hosted model registry, signed-artifact verification, or multi-model switching should be a change in one file, not a refactor across the estimator. - Keep
Settingsthe single source of truth. Noos.environreads outside pydantic-settings; no sprinkledPath.home()calls. Track 2 almost certainly overrides configuration from a secret store, and if that override has one place to hook in, it's easy. - Keep status-file schema owned by pydantic, not hand-written JSON. Track 2 multi-tenancy means indexing into the status file by tenant; a pydantic model refactor is cheap, a hand-written dict refactor is not.
- Keep the
AnalysisConfigshape additive. The Phase 0 YAML schema will evolve through Phase 1 as Paper C's experiments surface needs. Additions are free (new optional fields); renames and removals invalidate prior experiments. Pydantic'sextra="forbid"catches typos at parse time while still allowing additive extension.
These are cheap-now / expensive-later items. Every other Track 2 decision can wait for a real trigger.