From a4186582fad06c900f8a9bc9f659d2ad73112122 Mon Sep 17 00:00:00 2001 From: Levi Neuwirth Date: Thu, 23 Apr 2026 09:12:48 -0400 Subject: [PATCH] untrack lab-internal ideation docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit RESEARCH.md and TECHNICAL.md are living R&D / engineering roadmap notes — pre-meeting drafts, speculative directions, and in-progress thinking that should evolve freely without public-repo concerns. Same for docs/research/, a new directory for pre-meeting scoping artifacts (e.g. the MoCap data-needs spec being drafted for the upcoming conversation with Dr. Shu). Files stay on disk in every checkout — the gitignore just stops them from entering the index. Anything that graduates to a user-facing artifact moves into docs/ (which is tracked and feeds mkdocs) rather than these files. --- .gitignore | 10 + RESEARCH.md | 784 --------------------------------- TECHNICAL.md | 1191 -------------------------------------------------- 3 files changed, 10 insertions(+), 1975 deletions(-) delete mode 100644 RESEARCH.md delete mode 100644 TECHNICAL.md diff --git a/.gitignore b/.gitignore index 0c58b27..216e017 100644 --- a/.gitignore +++ b/.gitignore @@ -70,6 +70,16 @@ Thumbs.db # --- Docs site build ------------------------------------------------------- site/ +# --- Ideation / lab-notebook docs ------------------------------------------ +# Living R&D notes and engineering roadmaps. Kept locally so they can +# evolve freely with in-progress thinking, pre-meeting drafts, and +# speculative directions that don't belong in the public repo. Anything +# under docs/research/ is treated the same way — personal / lab-internal +# working artifacts, not published docs. +/RESEARCH.md +/TECHNICAL.md +/docs/research/ + # --- Data and model weights (policy-enforced) ------------------------------ # Runtime job directories, subject data, and downloaded model caches must # never be committed. The default runtime location is under $XDG_DATA_HOME diff --git a/RESEARCH.md b/RESEARCH.md deleted file mode 100644 index c040c4c..0000000 --- a/RESEARCH.md +++ /dev/null @@ -1,784 +0,0 @@ -# NeuroPose Research and Ideation Notes - -A living R&D log for open design questions, speculative directions, and -planned experiments that are larger in scope than individual commits. -This is **not** user-facing documentation — items in here are -*candidates* for future work, and inclusion does not imply commitment. - -## How to use this document - -- Add a section when you start thinking about a new area of investigation. -- Each section should end with an **Open questions** or **Next steps** - block so it's obvious to a future you (or a new contributor) what the - active threads are. -- When something in here is decided and implemented, move it to the - relevant place in `docs/` or in the code itself and leave a short - pointer behind ("*See `docs/architecture.md` for the resolved design.*"). -- Consider the audience: yourself, Dr. Shu, David, Praneeth, and future - contributors. Assume they know pose estimation at a grad-student level - but may not have followed every prior conversation. - -## Contents - -- [DTW methodology](#dtw-methodology) -- [TensorFlow version compatibility](#tensorflow-version-compatibility) -- [MeTRAbs hosting and extensibility](#metrabs-hosting-and-extensibility) - ---- - -## DTW methodology - -### Current implementation (v0.1, commit 10) - -`neuropose.analyzer.dtw` ships three entry points, all built on top of -[`fastdtw`](https://github.com/slaypni/fastdtw) with -`scipy.spatial.distance.euclidean` as the point-distance function: - -- **`dtw_all(a, b)`** — single DTW on flattened `(frames, joints × 3)` - vectors. One scalar distance for the whole sequence. -- **`dtw_per_joint(a, b)`** — one DTW call per joint, returning a list - of per-joint distances and warping paths. Preserves per-joint - temporal alignment at J× the cost. -- **`dtw_relation(a, b, joint_i, joint_j)`** — DTW on the per-frame - displacement vector between two specific joints. The intent here is - to capture "how does the relationship between these two joints change - over time", which is translation-invariant and so immune to raw - camera-frame changes. - -These three correspond directly to the three helpers that existed -(broken) in the previous prototype's `analyzer.py`, ported forward with -bug fixes, types, and tests. **The port was mechanical — not a -methodological choice.** We inherited the FastDTW + Euclidean defaults -without validating them against the clinical research use cases, and -that validation is overdue. - -### Known limitations of the v0.1 approach - -#### FastDTW is an approximation, not exact DTW - -[FastDTW](https://cs.fit.edu/~pkc/papers/tdm04.pdf) is a multi-scale -approximation that runs in linear time by recursively refining a coarse -alignment. For the radius-based implementation in -`slaypni/fastdtw`, the distance is not guaranteed to match exact DTW, -and in pathological cases the error can be significant. For a research -codebase where the DTW distance is going to show up in a figure, that -matters. - -**Candidate exact alternatives** (all pip-installable): - -- [`dtaidistance`](https://github.com/wannesm/dtaidistance) — C-based, - supports both exact DTW and a `fast=True` approximation; also - supports shape-DTW and various constraint bands. Actively maintained, - and the underlying algorithms match the textbook. -- [`tslearn`](https://tslearn.readthedocs.io/) — ML-flavored toolkit - with exact DTW, soft-DTW (differentiable), Sakoe-Chiba banding, and - kernel-DTW. Good fit if we ever want to feed DTW distances into an - sklearn/PyTorch pipeline. -- [`cdtw`](https://github.com/statefb/dtw-python) / `dtw-python` — - Python port of the R `dtw` package; exhaustive options for windowing, - step patterns, and open-ended alignment. Less friendly API but the - most rigorously documented. - -#### Euclidean is a choice, not a default - -Treating `(x, y, z)` joint positions as a point in R³ and taking -Euclidean distances implicitly assumes the three axes are commensurable -in the same units, which is fine for MeTRAbs (mm) but throws away prior -knowledge about human motion. Alternatives worth considering: - -- **Angular distance on joint angles.** Compute joint angles per frame - (`extract_joint_angles` already exists) and run DTW on the angle - sequences rather than raw coordinates. Translation- and - scale-invariant by construction; well-matched to clinical metrics - like knee flexion angle. -- **Geodesic distance on SO(3)** for local joint rotations. Requires a - skeleton-rooted rotation parameterization; more work to set up but - the right metric for "how different are these two poses?" in a - biomechanics sense. -- **Mahalanobis distance** against a learned pose prior. This is the - "machine learning" answer — fit a covariance to a reference corpus - (normal gait from a healthy cohort), then measure distances in the - whitened space. Requires enough data to fit the prior without - overfitting, but makes "is this gait abnormal?" a calibrated question. - -#### Preprocessing: what invariance do we want? - -The v0.1 implementation is not invariant to anything. Two videos of the -same subject with a different camera position will give a different -DTW distance, which is almost certainly not what a clinician wants. -Candidate preprocessing steps: - -- **Translation invariance**: subtract the root joint (pelvis or torso - centroid) from every joint per frame, so all poses are expressed in a - body-relative coordinate frame. Cheap and almost always desired. -- **Scale invariance**: divide by a reference length (e.g., torso - length, or total skeleton span) so tall and short subjects produce - comparable distances. Important for comparing across subjects. -- **Rotation invariance**: align to a canonical frame (e.g., hip-to-hip - vector = x-axis, hip-to-shoulder = z-axis) per frame. Required if the - subject's orientation relative to the camera varies between trials. -- **Procrustes alignment per frame**: fit the best rigid transform - (rotation + translation) between pose A's frame and pose B's frame - before computing distance. The closed-form - [Kabsch algorithm](https://en.wikipedia.org/wiki/Kabsch_algorithm) is - fast and exact. This is likely the *right* thing for most comparison - use cases but has never been wired up. - -The `dtw_relation` helper is translation- and (for unit-vector -displacements) scale-invariant by construction, which is why it ends up -being the most useful of the three existing entry points in practice. - -#### Representation: coordinates, angles, velocities, or dual? - -The v0.1 DTW operates on **3D joint coordinates** (translation-dependent) -or **joint-pair displacements** (`dtw_relation`). Other representations -worth comparing: - -- **Joint angles.** Using `extract_joint_angles` output as the DTW - input gives a rotation-and-translation-invariant comparison that's - also directly interpretable in clinical terms. -- **Joint velocities.** Temporal derivatives of position. Emphasizes - *how the pose changes* rather than *what it is* — good for - discriminating smooth from jerky motion in gait. -- **Dual (position + angle).** Concatenate normalized position and - angle features into a single per-frame vector. More expressive but - requires tuning the relative weights. -- **Learned embeddings.** Feed each frame through a pretrained - pose-representation network (there are a few) and DTW on the - embedding space. Expensive and opaque but may capture - higher-order structure. - -#### Multi-scale approaches - -FastDTW is already multi-scale internally. Other ideas: - -- **Coarse-to-fine DTW.** Downsample aggressively, run exact DTW on - the coarse version to get a sub-quadratic alignment, then refine - locally. This is essentially what FastDTW does, but with an explicit - signal-processing hat on. -- **Wavelet-decomposed DTW.** Decompose each joint's trajectory into - wavelet coefficients and run DTW on the low-frequency coefficients. - Unclear whether this actually helps; interesting because it separates - posture (low-frequency) from tremor / micro-motion (high-frequency). - -#### Clinical gait: cycle-aware DTW - -Gait is approximately periodic, and "the 4th heel-strike of trial A" -is the clinically meaningful comparison point to "the 4th heel-strike -of trial B", not "frame 120 of A vs frame 120 of B". A natural two-stage -approach: - -1. **Cycle detection.** Find heel-strikes (or other gait events) via - peak detection on a joint's vertical coordinate, and segment each - trial into individual cycles. -2. **Per-cycle DTW.** Time-warp within each cycle independently to - normalize cycle duration. The distance between trials is then the - sum / mean of per-cycle distances. - -This is standard in the biomechanics literature -([Sadeghi et al. 2000](https://doi.org/10.1016/S0966-6362(00)00074-3) -and descendants) and is almost certainly a better fit for clinical -comparison than the naive full-trial DTW we ship today. - -#### Soft-DTW for learning applications - -[Soft-DTW](https://arxiv.org/abs/1703.01541) is a differentiable -relaxation of DTW, which means gradients can flow through it. This -matters if we ever want to train a network to *learn* a distance -metric or an embedding under a DTW objective — for example, a pose -encoder whose output space is calibrated to gait similarity. Worth -keeping on the radar even if we're not training anything today. -`tslearn` implements it. - -### Evaluation strategy - -Validating a DTW implementation is harder than validating most things. -Some ideas for how to know we got it right: - -- **Synthetic perturbations.** Take a reference sequence and apply - known perturbations (time stretch, added noise, spatial offset) and - verify that distance scales monotonically with perturbation magnitude - and that invariance properties are honored. -- **Reference implementation parity.** For a small set of hand-picked - pairs, compute DTW distance using `dtaidistance` exact DTW and - our implementation, and verify the approximation error is below a - documented threshold. -- **Inter-rater clinical benchmark.** When we have labeled clinical - data, measure how well DTW distance correlates with clinician - ratings of gait similarity. This is the real test but is gated on - having data we can use. -- **Pathology discrimination.** Can DTW distance separate healthy - from impaired gait in a held-out set? This is the usefulness test. - -### Open questions - -1. Is FastDTW good enough, or should we move to `dtaidistance` exact - DTW as the default? (First concrete experiment: pick 20 pairs from - whatever reference data we can source, compute distance both ways, - see if the approximation error is acceptable.) -2. What's the right representation for clinical gait DTW — raw - coordinates, joint angles, or per-pair displacements? -3. Should we implement Procrustes alignment as a preprocessing step - before any DTW call? (If yes, it belongs in `neuropose.analyzer.features`.) -4. Should the clinical pipeline use cycle-segmented DTW instead of - full-trial DTW? This is a methodological choice with real - downstream implications. -5. Is soft-DTW useful to us, or is it a solution looking for a - problem we don't have? -6. What reference corpus do we use to develop and validate any of this? - -### Reading list - -- Sakoe, H. & Chiba, S. (1978). "Dynamic programming algorithm - optimization for spoken word recognition." The original DTW paper. -- Salvador, S. & Chan, P. (2007). "Toward accurate dynamic time - warping in linear time and space." - [PDF](https://cs.fit.edu/~pkc/papers/tdm04.pdf). The FastDTW paper. -- Cuturi, M. & Blondel, M. (2017). "Soft-DTW: a Differentiable Loss - Function for Time-Series." [arXiv 1703.01541](https://arxiv.org/abs/1703.01541). -- Sadeghi, H. et al. (2000). "Symmetry and limb dominance in able-bodied - gait: a review." Biomechanics reference for cycle-aware analysis. -- `dtaidistance` documentation — - . Worth reading even if we - don't switch, for the overview of DTW variants and constraints. - -### Next steps - -- [ ] Pick 10–20 reference pose-sequence pairs and run both FastDTW and - exact DTW on them to quantify the approximation error. -- [ ] Prototype a Procrustes-aligned preprocessing wrapper and - re-run the same pairs. -- [ ] Sketch a cycle-aware DTW pipeline against a gait dataset we can - actually use (identity- and IRB-safe). -- [ ] Decide whether to keep FastDTW as the default or replace it. -- [ ] If we replace it: migrate `neuropose.analyzer.dtw` to the new - backend in a single commit with no API change. - ---- - -## TensorFlow version compatibility - -### The question - -The pinned MeTRAbs model artifact -(`metrabs_eff2l_y4_384px_800k_28ds.tar.gz`) is a TensorFlow SavedModel. -SavedModels embed a producer TF version and depend on a set of TF op -kernels. Picking a TF version pin that is too low risks Apple Silicon -install pain (pre-2.16 has no native `darwin/arm64` wheel under the -`tensorflow` package name); picking one that is too high risks loading -or runtime failures if MeTRAbs uses ops that have been renamed, -deprecated, or removed. The goal of this investigation was to find the -**minimum** pin that works on Linux x86_64, Linux arm64, and macOS arm64 -without forcing platform-conditional dependencies or shipping -`tensorflow-metal` as a default. - -### Method - -Phase 0 of the procedure laid out earlier in this document was to -inspect the SavedModel directly and run `detect_poses` end-to-end on a -synthetic input. The probe script (`test.py` at the repo root, kept -during the investigation and removed in the same commit that landed the -pin) did three things: - -1. Parsed `saved_model.pb` with `saved_model_pb2.SavedModel` and read - the `tensorflow_version` and `tensorflow_git_version` fields out of - each `meta_info_def` to establish the **producer** version. -2. Walked every `node.op` and `library.function[*].node_def[*].op` in - the graph to enumerate the **complete set of ops** the model relies - on. This is the binary-compatibility surface — anything in this set - that gets removed in a future TF release breaks the model. -3. Called `tf.saved_model.load(MODEL_DIR)`, accessed - `per_skeleton_joint_names["berkeley_mhad_43"]`, and invoked - `model.detect_poses(image, intrinsic_matrix=..., skeleton="berkeley_mhad_43")` - on a 288×384 black frame to confirm the consumer TF version actually - *runs* the model (not just loads it — these are different failure - modes). - -The probe ran on Linux x86_64 against whatever `uv sync --group dev` -resolved at the time, which was **TensorFlow 2.21.0** with **Keras -3.14.0** — i.e. the most recent TF release as of 2026-04 and a version -that crosses the Keras-3 cutover at TF 2.16. - -### Result - -- **Producer version:** `tf version: 2.10.0`, - `producer: v2.10.0-0-g359c3cdfc5f`. The model was serialized in - September 2022, consistent with the file mtimes in the extracted - tarball. -- **Custom ops:** **zero**. `tf.raw_ops.__dict__` filtered for - `"metrabs"` returned `[]`. Every op in the SavedModel is a stock - TensorFlow kernel that has been stable since at least TF 2.4. -- **Op inventory** (recorded for posterity so a future contributor can - diff against a newer MeTRAbs release without re-running the probe): - - ``` - Abs, Add, AddV2, All, Any, Assert, AssignVariableOp, AvgPool, - BatchMatMulV2, BiasAdd, Bitcast, BroadcastArgs, BroadcastTo, Cast, - Ceil, Cholesky, CombinedNonMaxSuppression, ConcatV2, Const, Conv2D, - Cos, Cross, Cumsum, DepthwiseConv2dNative, Einsum, EnsureShape, Equal, - Exp, ExpandDims, Fill, Floor, FloorDiv, FloorMod, FusedBatchNormV3, - GatherV2, Greater, GreaterEqual, Identity, IdentityN, If, - ImageProjectiveTransformV3, LeakyRelu, Less, LessEqual, Log, - LogicalAnd, LogicalNot, LogicalOr, LookupTableExportV2, - LookupTableFindV2, LookupTableImportV2, MatMul, MatrixDiagV3, - MatrixInverse, MatrixSolveLs, MatrixTriangularSolve, Max, MaxPool, - Maximum, Mean, MergeV2Checkpoints, Min, Minimum, Mul, - MutableDenseHashTableV2, Neg, NoOp, NonMaxSuppressionWithOverlaps, - NotEqual, Pack, Pad, PadV2, PartitionedCall, Placeholder, Pow, Prod, - RaggedRange, RaggedTensorFromVariant, RaggedTensorToTensor, - RaggedTensorToVariant, Range, Rank, ReadVariableOp, RealDiv, Relu, - Reshape, ResizeArea, ResizeBilinear, RestoreV2, ReverseV2, - RngReadAndSkip, SaveV2, Select, SelectV2, Shape, ShardedFilename, - Sigmoid, Sin, Size, Slice, Softplus, Split, SplitV, Sqrt, Square, - Squeeze, StatefulPartitionedCall, StatelessIf, - StatelessRandomUniformV2, StatelessWhile, StaticRegexFullMatch, - StridedSlice, StringJoin, Sub, Sum, Tan, Tanh, TensorListConcatV2, - TensorListFromTensor, TensorListGetItem, TensorListReserve, - TensorListSetItem, TensorListStack, Tile, TopKV2, Transpose, Unpack, - VarHandleOp, Where, While, ZerosLike - ``` - -- **Load:** `tf.saved_model.load` returned a `_UserObject` with - `detect_poses` exposed. No warnings about deprecated kernels, no - errors. The 11-minor-version forward jump from producer 2.10 to - consumer 2.21 was a non-event, including the Keras 3 cutover at 2.16. -- **Skeleton check:** `per_skeleton_joint_names["berkeley_mhad_43"]` had - shape `(43,)` and `per_skeleton_joint_edges["berkeley_mhad_43"]` had - shape `(42, 2)`, exactly matching what - `tests/integration/test_estimator_smoke.py` asserts. -- **End-to-end inference:** `model.detect_poses` on a black 288×384 - frame returned `{'poses3d': (0, 43, 3), 'boxes': (0, 5), - 'poses2d': (0, 43, 2)}`, all `float32`. Zero detections is the - correct output for a black frame — the important signal is that the - shapes, dtypes, and key names exactly match what `FramePrediction` in - `neuropose.io` is built to ingest, so the entire estimator pipeline - is wire-compatible with this TF version. - -### Decision - -Pin `tensorflow>=2.16,<2.19`. Reasoning: - -1. **2.16 is the Apple Silicon floor that matters.** TF 2.16 is the - first release with native `darwin/arm64` wheels published on PyPI - under the `tensorflow` package name. Below 2.16, Mac users would - need `tensorflow-macos` (a separate Apple-maintained package), which - forces ugly platform markers in `pyproject.toml` and means Linux and - Mac users run subtly different codebases. Above 2.16, the same - single dependency line installs cleanly on every supported platform. -2. **MeTRAbs imposes no upper bound below 3.0.** Producer 2.10 → consumer - 2.21 (an 11-minor-version jump across the Keras 3 boundary) loaded - and ran without a single complaint. The op inventory is 100% stock, - so future TF 2.x releases would only break this if they removed - stable kernels — which would itself be a TF 2.x SemVer violation. -3. **`tensorflow-metal` is an opt-in extra, not a default.** - `tensorflow-metal` is a PluggableDevice that Apple ships separately - to add a Metal-backed `/GPU:0`. It has its own version-compatibility - table (Apple maintains it at - `developer.apple.com/metal/tensorflow-plugin/`), has a documented - history of producing silently-wrong numerics on specific TF ops, - and breaks intermittently on Keras 3. For a clinical-research - pipeline where reproducibility matters more than inference latency, - CPU inference on Mac is the right default. We do ship a - `[project.optional-dependencies].metal` extra that pulls - `tensorflow-metal>=1.2,<2` under darwin/arm64 platform markers, so - users who want the speedup can opt in via - `pip install 'neuropose[metal]'` — but the Metal path is not - exercised in CI, is documented as experimental in - `docs/getting-started.md`, and users are expected to spot-check - `poses3d` output against the CPU path before trusting it for any - clinical measurement. -4. **`tensorflow-metal` forces a TF upper bound.** `tensorflow-metal` - 1.2.0 (released January 2025, the latest version as of 2026-04) is - advertised as supporting "TF 2.18+" but in practice fails on - 2.19 and 2.20 with symbol-not-found errors and graph-execution - `InvalidArgumentError`s. See - [tensorflow/tensorflow#84167](https://github.com/tensorflow/tensorflow/issues/84167) - and the Apple Developer forum threads at - [developer.apple.com/forums/thread/772147](https://developer.apple.com/forums/thread/772147) - and [developer.apple.com/forums/thread/803658](https://developer.apple.com/forums/thread/803658). - 2.18.x is the last version confirmed to work cleanly on Apple - Silicon GPU. Even though the Metal path is opt-in, dependency - resolution is shared — if uv resolves `tensorflow` to 2.21 on a - Linux developer's machine and 2.18 on the Mac, lockfile churn - and "works on my box" become permanent. Cap is therefore applied - globally rather than via a darwin/arm64 marker split. Cost on - Linux is zero: nothing in the pipeline depends on TF 2.19+ - features, and the SavedModel ran fine on TF 2.21 in the probe - above, so the cap is purely an external-package constraint. Lift - it once Apple ships a Metal plugin that tracks mainline - TensorFlow again. - -### What is **not** yet verified - -- The probe ran on Linux x86_64 only. macOS arm64 has not been exercised - on real hardware. The argument that it should work is by construction - — `tensorflow==2.16+` ships native arm64 macOS wheels, the SavedModel - uses zero custom ops, and there is no MeTRAbs-side platform code — but - empirical confirmation is still pending. -- Linux arm64 has likewise not been exercised. Same by-construction - argument applies. -- A `macos-14` GitHub Actions matrix entry (which would run the unit - tests on Apple Silicon hardware) is the cheapest way to catch any - regression and is the intended follow-up. -- Inference-output numerics have not been compared across platforms. - This is the next layer of rigor below "does it run" — we expect - fp32 results to match within ~1e-3 mm on `poses3d`, but a real - cross-platform diff against a reference set has not been done. -- The `[metal]` optional-dependencies extra exists in `pyproject.toml` - but the Metal code path has never been exercised against the - pinned MeTRAbs SavedModel. Enabling it is a pure opt-in and comes - with a documented "verify your own numerics" caveat in - `docs/getting-started.md`. Whether it actually produces a speedup - on EfficientNetV2-L-based inference on real clinical videos — - and whether that speedup is worth the numerical-divergence risk - — is unknown. - -### Open questions - -1. Does the same `detect_poses` call produce numerically equivalent - `poses3d` on macOS arm64 as on Linux x86_64 against a real (non-black) - reference image? Within what tolerance? -2. If a future MeTRAbs release introduces a custom op (e.g. for a new - detector head), how do we want the loader to fail? Currently the - `_REQUIRED_MODEL_ATTRS` interface check would still pass; the failure - would surface at first `detect_poses` call, which is late. -3. ~~Does it make sense to upper-bound the pin more tightly than `<3.0` - (e.g. `<2.22` to bound to tested versions), or is the SemVer guard - sufficient given the all-stock-ops result?~~ **Resolved 2026-04-16.** - Tightened to `<2.19` for `tensorflow-metal` compatibility. See - reasoning point 4 in the Decision section above. - -### Next steps - -- [ ] Run the same probe on real macOS arm64 hardware and log the - result (load success, detect_poses success, output numerics - diff against the Linux baseline). -- [ ] Add a `macos-14` matrix entry to `.github/workflows/ci.yml` for - the unit tests. Slow tests stay Linux-only to avoid doubling the - MeTRAbs download cost in CI. -- [ ] Re-run the probe whenever MeTRAbs upstream publishes a new model - tarball, and diff the op inventory above. Any new op that is not - in the list above is a flag worth investigating before raising - the pin. -- [ ] Benchmark `[metal]` vs CPU on a real Apple Silicon Mac against - a short reference clip: measure (a) per-frame latency, (b) peak - memory, and (c) `poses3d` divergence from the CPU baseline. If - the speedup is meaningful and the numerics are within - ~1e-2 mm, move the `metal` extra from "experimental" to - "supported" in the docs. If not, document the failure mode - here and keep the extra where it is. - ---- - -## MeTRAbs hosting and extensibility - -### Current state (v0.1, commit 11) - -The model loader in `neuropose._model.load_metrabs_model` will pin the -canonical upstream URL: - -``` -https://omnomnom.vision.rwth-aachen.de/data/metrabs/metrabs_eff2l_y4_384px_800k_28ds.tar.gz -``` - -This is the RWTH Aachen "omnomnom" host — a raw HTTP file server run -by the MeTRAbs authors' lab. There is no current HuggingFace mirror -of the relevant MeTRAbs variant at the time of commit 11. - -The URL encodes the model configuration: -`metrabs_eff2l_y4_384px_800k_28ds` means the EfficientNetV2-L backbone, -YOLOv4 detector head, 384-pixel input, 800k training steps, trained on -28 datasets. This name pattern is worth preserving when we host the -model ourselves so future variants stay self-describing. - -### Supply-chain concerns - -Pinning a single upstream URL to a third-party academic host is a -real supply-chain risk, and the audit of the previous prototype called -it out explicitly (the old code used `bit.ly/metrabs_1`, which was -even worse). Concrete failure modes: - -- The RWTH Aachen host goes down or is decommissioned. -- The URL changes when Sárándi releases a new MeTRAbs version. -- The tarball contents change under the same URL without a version bump. - -**Minimum mitigation** (should land in or immediately after commit 11): - -- **Pin a SHA-256 checksum** alongside the URL, and verify on download - before unpacking. If the checksum doesn't match, fail hard with a - clear error. -- **Cache aggressively.** Once downloaded and verified, never hit the - network again for the same configuration. `model_cache_dir` is - already in `Settings`. -- **Document the exact filename and checksum** in `RESEARCH.md` (or - migrate to a `MODEL_ARTIFACTS.md` file) so operators have a way to - manually download the model out-of-band if the primary URL is dead. - -### Self-hosting options - -We want to host the model ourselves, both for reliability and because -it opens the door to future fine-tuning and redistribution of our own -variants. Candidate hosting approaches: - -#### Forgejo LFS - -Pros: -- Lives next to the code. -- Version-controlled artifacts. -- Access control mirrors repo access. - -Cons: -- LFS is designed for git-tracked binary assets, not for large - infrequently-updated model weights — you pay LFS overhead on every - clone unless you configure `lfs.fetchexclude`. -- Model is ~2.2 GB; Forgejo LFS performance at that size is untested - for our instance. -- Pinning is by LFS pointer, which means the model is coupled to a - particular repo revision. Messy if we want multiple code revisions - to share the same model. - -**Verdict:** Workable but not the best fit. - -#### Forgejo generic package registry - -Forgejo supports a [generic package -registry](https://forgejo.org/docs/latest/user/packages/generic/) that -can host arbitrary binary artifacts with versioned URLs. This is -closer to what we want: - -``` -https://git.levineuwirth.org/api/packages/neuwirth/generic/metrabs/eff2l_y4_384px_800k_28ds/metrabs.tar.gz -``` - -Pros: -- Versioned URLs decoupled from repo revisions. -- Upload once, download many times, no clone coupling. -- Integrated auth if we want to gate access. -- Can be made public even if the repo is private. - -Cons: -- Requires uploading the file manually or via an API call. -- Forgejo registry size / bandwidth limits depend on the instance. - -**Verdict:** Probably the best fit for "we want it hosted alongside -the project." - -#### Plain HTTP server on a VPS subdomain - -A dedicated subdomain like `models.levineuwirth.org` backed by a -simple HTTP file server (nginx `autoindex`, or Caddy with a tidy -directory layout). Example URL: - -``` -https://models.levineuwirth.org/metrabs/metrabs_eff2l_y4_384px_800k_28ds.tar.gz -``` - -Pros: -- Simplest possible story. No API, no auth machinery. -- Easy to mirror from — anyone can curl the URL. -- Decoupled from the git forge, so we can share models publicly even - when the repo itself is private. -- Easy to put a CDN in front (Cloudflare) if bandwidth ever matters. - -Cons: -- Manual upload via scp/rsync. -- No access control unless we add it. -- No versioning beyond filename convention. - -**Verdict:** Strong candidate. This is probably the right choice for -v0.1 of self-hosted models. - -#### S3-compatible object storage (MinIO self-hosted) - -Run MinIO on the VPS, get S3-compatible API for free, and serve models -via pre-signed URLs or a public bucket. - -Pros: -- Proper object storage with ETags, range requests, multipart uploads. -- Integration story is straightforward if we ever move to cloud-hosted - storage. -- Industry-standard API. - -Cons: -- More operational complexity than a plain HTTP server for what might - be a handful of files. - -**Verdict:** Overkill for v0.1 but worth revisiting if model storage -becomes a real operational concern. - -### Integrity: SHA-256 pinning - -Regardless of which hosting approach we pick, **the model loader should -always verify a SHA-256 checksum** before trusting the downloaded -artifact. This is the one piece of supply-chain hygiene that has to be -in place before we ship commit 11 to any user outside the Shu lab. - -Implementation sketch for `neuropose/_model.py`: - -```python -def load_metrabs_model(cache_dir: Path | None = None) -> Any: - cache_dir = cache_dir or _default_model_cache_dir() - cache_dir.mkdir(parents=True, exist_ok=True) - tarball = cache_dir / _MODEL_FILENAME - if not tarball.exists(): - _download(_MODEL_URL, tarball) - _verify_sha256(tarball, _MODEL_SHA256) - extracted = _extract_if_needed(tarball, cache_dir) - return tfhub.load(str(extracted)) # or tf.saved_model.load -``` - -The `_MODEL_SHA256` constant is the source of truth; if it ever has -to change, the constant change is visible in the git diff and a human -reviews it. - -### Fine-tuning - -The next research direction after we have inference working is -fine-tuning MeTRAbs on clinical-specific data. Open questions: - -- **What data?** Any clinical data is IRB-gated. Even de-identified - pose data may carry subject information if the recording conditions - (lighting, room layout) are distinctive enough. Any training plan - has to run through the data-handling policy that lives (will live) - in `docs/data-policy.md`. -- **Transfer learning strategy.** - - *Head-only fine-tuning*: freeze the EfficientNetV2-L backbone and - re-train the pose regression head on clinical data. Fast, low - compute, unlikely to overfit, but also unlikely to capture - clinical-pose idiosyncrasies. - - *Low-LR full fine-tune*: unfreeze everything, use a learning rate - 1/100th of the original, train for a few epochs. Better - adaptation, higher risk of catastrophic forgetting. - - *Adapter layers*: insert small trainable adapters into the frozen - backbone. Parameter-efficient, well-studied in NLP, less common - for pose but should work. -- **Compute requirements.** EfficientNetV2-L is roughly 120M parameters; - fine-tuning on a single modern GPU (24 GB VRAM) is feasible at - reduced batch size. A multi-GPU node is friendlier but not strictly - required. -- **Evaluation.** We need held-out clinical data with trusted ground - truth. MoCap-derived poses are the gold standard; marker-based MoCap - systems provide sub-millimeter accuracy at the cost of subject - instrumentation. The Shu lab's access to MoCap is the gating factor. -- **Sharing fine-tuned weights.** If we fine-tune on clinical data, the - resulting weights may encode subject information in ways that are - non-obvious and potentially IRB-relevant. Sharing fine-tuned weights - externally has to be cleared through the same channels as sharing the - training data. - -### Training our own pose estimator - -The long-range version of the research direction: train a pose -estimator from scratch that extends MeTRAbs's methodology. MeTRAbs is -a good starting point because the method is well-documented: - -- Sárándi, I., et al. (2020). "MeTRAbs: Metric-Scale Truncation-Robust - Heatmaps for Absolute 3D Human Pose Estimation." - [arXiv 2007.07227](https://arxiv.org/abs/2007.07227), - IEEE Transactions on Biometrics, Behavior, and Identity Science. - -Core contributions (worth knowing if you modify any of this): - -- **Truncation-robust heatmaps.** Instead of predicting a 2D heatmap - bounded by the image, MeTRAbs predicts a heatmap that extends - *outside* the image and can place a joint at coordinates the image - alone could not disambiguate. Critical for crops where the subject - is partially out of frame. -- **Metric scale regression.** MeTRAbs predicts the absolute 3D - positions of joints in millimetres by combining a 2D heatmap with a - per-joint depth regressor. Most 3D pose methods produce only - relative coordinates, which are useless for clinical measurement. -- **Multi-dataset training with a common skeleton.** The 28-dataset - training set unifies disparate skeleton topologies into a common - 43-joint Berkeley MHAD skeleton, which we carry forward in - NeuroPose. - -**Natural extensions worth considering:** - -- **Temporal smoothing head.** MeTRAbs is a per-frame model. Clinical - gait analysis wants temporally smooth trajectories. Adding a - lightweight temporal head (1D CNN or small transformer over frame - sequences) could produce smoother outputs without touching the - backbone. -- **Clinical-specific heatmap supervision.** If we have MoCap data for - clinical poses, we can use it as ground-truth heatmap supervision to - improve accuracy in the pose ranges the model sees least often in - the 28-dataset training corpus (e.g., pathological gaits, walker- - assisted ambulation). -- **Multi-person identity tracking.** MeTRAbs produces detections per - frame without continuity across frames. Adding a Hungarian-matched - tracker (or a learned tracker) would solve the multi-person - identity problem that `predictions_to_numpy` currently dodges with - a `person_index` parameter. -- **Alternative backbones.** EfficientNetV2-L is a 2020-era choice. - Newer backbones (ConvNeXt, DINOv2-initialized ViTs) may give - meaningful gains, especially for clinical poses that are - under-represented in the original training set. -- **Uncertainty estimation.** Clinical users want to know when the - model is unsure. A Gaussian output head (mean + variance per joint) - or an ensemble-based approach would let us propagate uncertainty - into downstream analysis. - -**Compute requirements:** training MeTRAbs from scratch was reported -as "a few weeks" on 8x V100 in the original paper. A from-scratch -re-training is a substantial undertaking. Fine-tuning is much more -accessible. - -### Collaboration opportunities - -- **István Sárándi** (now at University of Tübingen, formerly RWTH - Aachen) is the author of MeTRAbs. The code is MIT-licensed and he - has historically been responsive to collaboration requests. If we - end up publishing work that significantly extends MeTRAbs, at the - very least we should reach out about co-authorship or - acknowledgment; at best we might find an active collaborator. -- **The Shu Lab's existing collaborators** on clinical gait research - at Brown and partner institutions may have MoCap-validated datasets - we can use for fine-tuning and evaluation. Worth asking Dr. Shu. - -### Open questions - -1. Does Forgejo's generic package registry actually handle a 2.2 GB - upload cleanly, or do we need the plain HTTP server route? -2. What's the right SHA-256 pin to commit alongside the URL? (Need to - download the tarball first and run `sha256sum`.) -3. Do we have access to MoCap-validated clinical gait data for - fine-tuning evaluation? This gates every training-related - experiment. -4. Is fine-tuning even worth pursuing before we have inference results - that are clearly *not* good enough on clinical data? (I.e., - motivate the work with concrete failure cases rather than assuming - a delta we haven't measured.) -5. Does it make sense to reach out to Sárándi now, or wait until we - have something concrete to collaborate on? - -### Reading list - -- Sárándi, I. et al. (2020). "MeTRAbs: Metric-Scale Truncation-Robust - Heatmaps for Absolute 3D Human Pose Estimation." - [arXiv 2007.07227](https://arxiv.org/abs/2007.07227). **Essential - reading** for anyone planning to extend the method. -- Sárándi's personal site and the MeTRAbs GitHub repo - () — the code, model zoo, and - training scripts live here. -- Zheng, C. et al. (2023). "Deep Learning-Based Human Pose Estimation: A - Survey." Good survey paper for orienting on the state of the art. -- The original 28-dataset training composition referenced in the - MeTRAbs paper — worth tracing through to understand what poses are - in- and out-of-distribution for the pretrained model. - -### Next steps - -- [ ] Download the pinned tarball and compute its SHA-256 for the - commit-11 model loader. -- [ ] Decide between Forgejo generic registry and plain HTTP subdomain - for self-hosting. Prototype whichever one wins. -- [ ] Mirror the pinned tarball to the chosen self-hosted location so - we can fall over to it the moment the RWTH URL changes or goes - down. -- [ ] Write a one-page "MODEL_ARTIFACTS.md" that documents every model - version we use, its checksum, and its canonical source URL. -- [ ] Have the data-access conversation with Dr. Shu about clinical - training data. Everything else is blocked on this. -- [ ] (Much later) Reach out to Sárándi about potential collaboration - once we have something concrete to talk about. diff --git a/TECHNICAL.md b/TECHNICAL.md deleted file mode 100644 index b48238b..0000000 --- a/TECHNICAL.md +++ /dev/null @@ -1,1191 +0,0 @@ -# NeuroPose Technical Ideation Notes - -A living engineering roadmap, parallel to `RESEARCH.md`. Where -`RESEARCH.md` captures open methodological questions (DTW, skeleton -choice, hosting the model), this document captures open *engineering* -questions — release readiness, operability, scaling — and the paths -they could take. - -This is **not** user-facing documentation. Items here are *candidates* -for future work, and inclusion does not imply commitment. - -## How to use this document - -- Add a section when you start thinking about a new area of technical - investment. -- Each section should end with a **Scope**, **Sketch**, or **Open - questions** block so it's obvious to a future you (or a new - contributor) what the concrete next move would be. -- When an item in here is decided and implemented, move it to the - relevant place in `docs/` or in the code itself, and leave a short - pointer behind (*See `docs/deployment.md` for the resolved design.*). -- The audience is anyone maintaining the codebase — Levi, David, - Praneeth, Dr. Shu, and whoever comes after us. Assume competence in - Python and systems work; don't assume familiarity with our specific - tooling choices. - -## Three phases, then a contingent track - -There are four distinct technical objectives, ordered by timeline and -by what each enables next. The sequencing is deliberate: each phase -unblocks the next, and doing them in any other order either publishes -Paper C on top of a pipeline its own design notes disavow, or delays -the open-source release past the window where the accompanying paper -is still salient. - -1. **Phase 0 — C-enabling pipeline work.** A targeted subset of - engineering work that has to land *before* Paper C can start. The - DTW defaults shipped in 0.1 are explicitly a "mechanical port, not - a methodological choice" (see `RESEARCH.md` §1); running the - clinical validation study on them would mean publishing results - from a pipeline the accompanying design notes explicitly criticize. - Phase 0 fixes the analyzer's methodological foundations (Procrustes - preprocessing, cycle segmentation, joint-angle DTW representation), - locks in the reproducibility surface (`Provenance` subobject, - YAML-configurable analysis pipeline), and sets up schema migration - so data generated during Phase 1 survives the long write-up. - **Near-term, well-scoped, weeks of work.** - -2. **Phase 1 — Paper C: clinical validation study.** The planned - clinical-methods paper: cycle-aware joint-angle DTW for clinical - gait similarity, validated against MoCap ground truth and/or - clinician ratings. Gated on MoCap data access via Dr. Shu. This is - research work, not engineering work — this document describes the - engineering scaffolding *around* it, not the paper itself. Phase 2 - work can happen in the background during this phase as ideal filler - for research-burnout cycles. **Months; timeline driven by data - access and experimental design.** - -3. **Phase 2 — Coordinated open-source release + Paper A.** The - engineering-paper companion (A) describing the tech stack, plus - the tagged 0.1 release: PyPI publication, docs deployment, Docker - images, CI matrix, supervision artifacts, doctor preflight, all - the operational items that make the tool credible to external - users. Timed to arrive *with or slightly before* Paper C's - submission, producing a paper-plus-tool bundle that reviewers can - actually run. **Weeks of work, timing driven by Paper C's - submission window.** - -4. **Track 2 — Clinical platform (contingent).** Everything beyond - the open-source research tool — multi-tenancy, audit logging, - HTTP/API layer, clinician UI, clinical-system integrations. Not - sequenced; activates only if specific triggers fire (external - demand, multi-site ambition, funding mandate, publication - traction). Most of this is background thinking, not planned work. - The value of keeping it in this document is so that Phase 0 and - Phase 2 decisions don't accidentally foreclose Track 2 options. - -Phases 0 → 1 → 2 form a near-term sequence that culminates in a -paper-plus-release bundle. Track 2 sits outside that sequence and -does not gate any of it. - -## Contents - -- [Phase 0: C-enabling pipeline work](#phase-0-c-enabling-pipeline-work) - - [Procrustes preprocessing](#procrustes-preprocessing) - - [Gait cycle segmentation](#gait-cycle-segmentation) - - [Joint-angle DTW representation](#joint-angle-dtw-representation) - - [Provenance subobject](#provenance-subobject) - - [YAML-configurable analysis pipeline](#yaml-configurable-analysis-pipeline) - - [Schema migration for VideoPredictions](#schema-migration-for-videopredictions) -- [Phase 1: Clinical validation study (Paper C)](#phase-1-clinical-validation-study-paper-c) -- [Phase 2: Coordinated open-source release + Paper A](#phase-2-coordinated-open-source-release--paper-a) - - [Release definition](#release-definition) - - [Apple Silicon CI matrix](#apple-silicon-ci-matrix) - - [Mac hardware validation pass](#mac-hardware-validation-pass) - - [Retention and pruning](#retention-and-pruning) - - [neuropose doctor preflight](#neuropose-doctor-preflight) - - [Process supervision artifacts](#process-supervision-artifacts) - - [Structured logging option](#structured-logging-option) - - [Monitor authentication](#monitor-authentication) - - [Docker GPU image](#docker-gpu-image) - - [Dependency freshness automation](#dependency-freshness-automation) - - [Release workflow](#release-workflow) - - [Error-path test coverage expansion](#error-path-test-coverage-expansion) -- [Track 2: Clinical platform (contingent)](#track-2-clinical-platform-contingent) - - [Triggers to activate Track 2](#triggers-to-activate-track-2) - - [Multi-tenancy and identity](#multi-tenancy-and-identity) - - [Audit logging and compliance posture](#audit-logging-and-compliance-posture) - - [HTTP/API layer](#httpapi-layer) - - [Clinician-facing UI](#clinician-facing-ui) - - [Horizontal scaling](#horizontal-scaling) - - [Backup, replication, and data durability](#backup-replication-and-data-durability) - - [Clinical-system integrations](#clinical-system-integrations) - - [Deterministic inference mode](#deterministic-inference-mode) - - [Observability and SLOs](#observability-and-slos) - - [Supply-chain attestation and signed releases](#supply-chain-attestation-and-signed-releases) - - [Deployment orchestration](#deployment-orchestration) -- [Decisions to not prematurely foreclose](#decisions-to-not-prematurely-foreclose) - ---- - -## Phase 0: C-enabling pipeline work - -The six items below are prerequisites for Paper C. Until they are -landed, every analysis C would produce would be running on defaults -that `RESEARCH.md` §1 explicitly flags as provisional. Ship these -first, in any order that suits the implementer's cadence, and the -rest of the project can pick up with confidence that Phase 1 results -are trustworthy. - -### Procrustes preprocessing - -**Status:** Not implemented. `neuropose.analyzer.features` ships -`extract_joint_angles` and feature-statistics helpers; no alignment -step exists between pose sequences. - -**Why it matters for Paper C:** without alignment, DTW distance is -translation- and orientation-dependent. Two recordings of the same -subject from different camera positions produce different distances, -which is almost never what a clinician wants. Paper C's methods -section would need to apologize for this in print; cheaper to fix the -method than to defend it. - -**Scope:** - -- Add `procrustes_align(a: np.ndarray, b: np.ndarray, *, mode: - Literal["per_frame", "per_sequence"]) -> tuple[np.ndarray, - np.ndarray, AlignmentDiagnostics]` to `neuropose.analyzer.features`. - Implements the Kabsch algorithm (closed-form optimal rigid - transform). Per-frame aligns each frame of A to the corresponding - frame of B independently; per-sequence computes one transform over - the whole sequence. Both are useful — per-frame for fine-grained - matching, per-sequence for preserving within-trial dynamics. -- Return aligned arrays plus an `AlignmentDiagnostics` dataclass with - the fitted rotation magnitude and translation magnitude so - downstream code can flag suspiciously large transforms (usually a - sign of upstream annotation error). -- Expose as an opt-in `align: Literal["none", "procrustes_per_frame", - "procrustes_per_sequence"] = "none"` parameter on every DTW entry - point in `neuropose.analyzer.dtw`. Default `none` preserves current - behavior; Paper C's pipeline sets it to `procrustes_per_sequence`. -- Unit tests: construct a known rotation + translation between two - synthetic skeletons, verify alignment recovers it to within - floating-point precision; verify alignment of a sequence with its - own translated copy produces zero residual. - -**Non-scope:** - -- Non-rigid alignment (thin-plate splines, learned registration). Not - needed for skeleton-level comparison and would be a research - contribution on its own. - -**Open question:** should alignment also include optional scaling -(scaled-Procrustes / full Procrustes)? For cross-subject comparison -it almost certainly should. Default to scale-preserving and add a -`scale: bool = False` flag; Paper C can flip it on for cross-subject -figures. - -### Gait cycle segmentation - -**Status:** `segment_by_peaks` in `neuropose.analyzer.segment` -performs generic valley-to-valley segmentation on a supplied 1D -signal. There is no gait-specific wrapper that knows to look at the -heel's vertical coordinate. - -**Why it matters for Paper C:** clinical gait analysis wants to -compare *the 4th heel-strike of trial A* to *the 4th heel-strike of -trial B*, not *frame 120 of A vs frame 120 of B*. Per-cycle DTW is -the standard approach in the biomechanics literature (Sadeghi et al. -2000 and descendants); running full-trial DTW on gait is a choice -reviewers of Paper C would correctly flag as methodologically weak. - -**Scope:** - -- New `segment_gait_cycles(predictions: VideoPredictions, *, joint: - str = "rhee", axis: Literal["x", "y", "z"] = "y", min_cycle_seconds: - float = 0.4) -> Segmentation` in `neuropose.analyzer.segment`. -- Under the hood: extract the specified joint's coordinate along the - specified axis, apply `segment_by_peaks` with appropriate distance - and prominence thresholds (derived from `min_cycle_seconds` via - `predictions.metadata.fps`), return the resulting `Segmentation` - (the existing `neuropose.io.Segmentation` type) so downstream - tooling picks it up unchanged. -- Two-sided detection: run the same detection on the opposite heel - and return *both* per-side segmentations under named keys - (`left_heel_strikes`, `right_heel_strikes`). Clinical users will - want both. -- Allow the reference joint and axis to be configurable so trials - recorded with a different camera orientation (lateral vs frontal - vs oblique) can still be segmented without a code change. - -**Non-scope:** - -- HMM-based cycle detection, learned cycle detectors. Peak detection - on vertical coordinate is standard, well-understood, and the - method the biomechanics literature expects to see. -- Handling pathological gaits where heel-strikes are absent - (shuffling, walker-assisted). The function should degrade - gracefully (return a `Segmentation` with an empty list, not raise), - and Paper C's data-quality filtering handles the rest. - -**Open question:** should the function also emit a "confidence" per -cycle (prominence of the detected peak, regularity of spacing) that -Paper C can use to filter out low-quality detections? Cheap to add, -useful downstream. Recommend yes. - -### Joint-angle DTW representation - -**Status:** `dtw_all`, `dtw_per_joint`, and `dtw_relation` operate on -raw 3D coordinates or joint-pair displacements. `extract_joint_angles` -produces per-frame angle sequences but is not wired as a DTW input. - -**Why it matters for Paper C:** angle-space DTW is translation- and -rotation-invariant by construction, scale-invariant if normalized, -and directly interpretable in clinical terms ("knee flexion angle -during swing phase"). Paper C's headline figures almost certainly -use angle-space distances; raw coordinates would draw the obvious -reviewer question of why we aren't comparing the thing clinicians -actually measure. - -**Scope:** - -- Add `representation: Literal["coords", "angles", "relation"] = - "coords"` to every DTW entry point. The `coords` default preserves - existing behavior; `angles` runs `extract_joint_angles` on each - input first; `relation` is the existing `dtw_relation` path - expressed as a representation choice rather than a separate - function (leaving the `dtw_relation` name as a convenience wrapper - if preferred). -- Degenerate-vector handling: `extract_joint_angles` returns NaN for - degenerate (zero-length) vectors. The DTW path needs to decide how - to handle NaN — skip-and-interpolate, drop, or propagate to the - distance. Propagation is safest (makes the problem visible); - interpolation is what clinical users probably want day-to-day. - Default to propagation and expose `nan_policy: Literal["propagate", - "interpolate", "drop"]` for experimentation. -- Tests: synthetic pair with known angular difference, assert DTW in - angle-space recovers it independent of global rotation applied to - the input. - -**Non-scope:** - -- Quaternion or SO(3) rotation-space DTW. Interesting but requires a - rotation parameterization the current skeleton output does not - produce. -- Mixed-representation (position + angle concatenated, learned - embeddings). These are experiments Paper C might run; they don't - belong in Phase 0 infrastructure. - -### Provenance subobject - -**Status:** `PerformanceMetrics` captures `tensorflow_version`, -`active_device`, and `tensorflow_metal_active`. Model SHA is not -computed or propagated. `numpy_version` and `neuropose_version` are -not recorded. No first-class `Provenance` object. - -**Why it matters for Paper C:** reproducibility is the first -question a reviewer asks of a clinical-methods paper. The answer -needs to be "same model artifact, same pipeline config, same -versions, same seeds" — and all four need to be recorded on every -`results.json` that underlies a paper figure. Not having this means -either manually tracking it in a lab notebook (fragile, won't -survive personnel turnover) or running every experiment through a -pinned Docker image (expensive, doesn't capture runtime -non-determinism). The subobject is the cheap right answer. - -**Scope:** - -- New `Provenance` pydantic model in `neuropose.io` with fields: - `model_sha256: str`, `model_filename: str`, `tensorflow_version: - str`, `tensorflow_metal_version: str | None`, - `numpy_version: str`, `neuropose_version: str`, `python_version: - str`, `seed: int | None`, `deterministic: bool`, `analysis_config: - dict | None` (the YAML of the run if the pipeline was invoked via - `neuropose analyze --config`). -- Optional `provenance: Provenance | None = None` field on - `VideoPredictions`, `JobResults`, and `BenchmarkResult`. None-valued - on legacy files (enabled by schema migration — see below), populated - on every new write. -- `_model.py` hashes the downloaded tarball on first load (after the - existing SHA verification — the two checks use the same hash so - compute is amortized) and exposes the hash via a - `get_model_sha256()` method on the `Estimator`. `Interfacer._run_job_inner` - constructs the `Provenance` and attaches it to the output. -- Unit test: serialize → JSON → deserialize round-trip identity; - assert `model_sha256` matches the SHA recorded in - `neuropose._model`. - -**Non-scope:** - -- Cryptographic signatures on results.json. That's Phase 2 (sigstore - on release artifacts) or Track 2 (per-output signing) territory, - not Phase 0. -- Provenance on arbitrary intermediate products (numpy arrays, DTW - distance matrices). Top-level JSONs cover Paper C's needs; richer - intermediates can inherit from a hand-off if needed. - -**Open question:** does Paper C need *per-frame* provenance (which -frame was processed with which configuration) or just per-job -provenance? Per-job is enough for reproducibility; per-frame is only -useful if we want to mix configurations within a single job, which -has no current use case. - -### YAML-configurable analysis pipeline - -**Status:** `neuropose.cli`'s `analyze` subcommand is a stub that -raises `NotImplementedError`. Analysis operations are called -individually from Python, or via CLI flags on `segment` and -`benchmark`. No unified representation of "a complete analysis run." - -**Why it matters for Paper C:** the paper will run many experimental -configurations — alignment on/off, per-frame vs per-sequence, raw -coordinates vs joint angles, full-trial vs cycle-segmented DTW, -various distance metrics. Each experiment should be reproducible -from a single file that can be version-controlled, diffed, attached -to the `Provenance` object, and cited in the paper. A Python script -full of kwargs is the alternative, and it's exactly the alternative -the open-source community collectively decided against ten years ago. - -This item also resolves the "`neuropose analyze`: ship or remove" -question that was previously open: we are shipping `analyze`, just -specifically in a YAML-driven form. The stub that currently exists -becomes the real command in Phase 0. - -**Scope:** - -- `AnalysisConfig` pydantic model in `neuropose.analyzer` capturing - the full pipeline: input source (predictions file path), - preprocessing (`align`, `normalize`, `segment`), per-segment - analysis (DTW backend, representation, distance function, extra - kwargs), output (figures, statistics, distance matrices). -- Parseable from YAML via pydantic; validated on parse so typos in - field names fail early with a clear error. -- `neuropose analyze --config experiment.yaml [--output - results_042.json]` runs the pipeline end-to-end. The config YAML - is serialized into the resulting `Provenance.analysis_config`, so - the output file is self-describing. -- Ship three or four *example* configs under `examples/analysis/` - that exercise the full matrix of alignment × representation × - segmentation choices Paper C will care about. Double as integration - tests. - -**Non-scope:** - -- A DAG / workflow engine (Snakemake, Nextflow). A flat config is - enough for Paper C's needs; reach for a DAG tool only when - experiments have genuine inter-stage dependencies, which analysis - of a single video does not. -- Parallel sweep execution. Run multiple configs via a shell loop - for now (`for cfg in examples/analysis/*.yaml; do neuropose - analyze --config "$cfg" --output "out/$(basename "$cfg" .yaml).json"; done`). - A real sweep orchestrator is Track 2. - -**Open question:** should there be a `neuropose analyze compare - ` subcommand that runs both and -emits a diff figure? Useful for Paper C but not a gating feature — -post-Phase-0 addition if the need is clear. - -### Schema migration for VideoPredictions - -**Status:** `VideoPredictions` gained `segmentations: dict[str, -Segmentation] = Field(default_factory=dict)` during recent work. Old -JSON files without the field still load (pydantic default-factories -back-fill), but this is accidental rather than designed-in. - -**Why it matters for Paper C:** Paper C will produce analysis results -over the course of 6-12 months. During that window, Phase 0 work -itself will evolve — the `Provenance` object will gain fields, the -`AnalysisConfig` shape will stabilize, maybe the `Segmentation` schema -will extend. Without migration support, every schema change would -invalidate some portion of Paper C's already-generated data, forcing -either a freeze (drops velocity) or a full re-run (wastes compute). -Migration now is the cheap fix. - -**Scope:** - -- Add a `schema_version: int = 1` field to `VideoPredictions`, - `JobResults`, and `BenchmarkResult` (the three load-anywhere - top-level schemas). -- Write `migrate_video_predictions(payload: dict) -> dict` that - takes a raw JSON-loaded dict, dispatches on `schema_version`, and - returns a dict conformant with the current version. Default to 1 - when missing (existing files). -- Wire it into `load_video_predictions()` so the migration runs - before pydantic validation. Log at INFO on migration so users see - when files are being upgraded. -- When writing, always write the current version. - -**Non-scope:** - -- A general-purpose migration framework. A function that dispatches - on an integer is sufficient until we have three versions. -- In-place migration (writing back the upgraded file). Migrations - should run on read; write-back is a separate operator decision. - ---- - -## Phase 1: Clinical validation study (Paper C) - -Phase 1 is *Paper C itself* — the clinical-methods paper this project -exists to produce. The content belongs in the paper, in `RESEARCH.md`, -and in the analysis-config YAMLs under `examples/`, not here. This -section exists only to demarcate the phase and to capture the -engineering commitments that should (and should not) happen during it. - -**Engineering posture during Phase 1:** - -- **Phase 0 is frozen on entry.** Don't refactor the analyzer during - Phase 1; refactors invalidate earlier experiments. If a Phase 0 - shortcoming surfaces during paper-writing, log it in `RESEARCH.md` - and revisit after submission. -- **Phase 2 work is welcome as background.** Writing a launchd plist, - wiring up Dependabot, tightening error-path tests — all of this is - ideal filler work during the experimental-design and writing - phases of Paper C. It consumes different energy than research work - does, and the tool is in better shape on submission day as a - result. -- **`RESEARCH.md` gets the bulk of the updates.** Methods decisions, - reading-list expansions, reviewer-response notes all live there, - not here. -- **Do add engineering-side notes here** when a Paper C experiment - reveals a piece of missing tooling that's worth a Phase 2 item - (for example: "we needed batch-analysis across 200 trials and hit - this, so Phase 2 should include ..."). Phase 1 is the best - possible source of prioritization signal for what Phase 2 is - actually worth. - -**Prerequisite outside this document:** a MoCap-data-access -conversation with Dr. Shu. Nothing in Phase 1 can start until that -conversation has resolved. `RESEARCH.md` §3 flags this as the -gating question for fine-tuning; it is equally the gating question -for validation. - ---- - -## Phase 2: Coordinated open-source release + Paper A - -Phase 2 is the release. Its content is exactly the items listed here -— the engineering work to take the Phase-0-plus-Phase-1 codebase to a -state where an outside researcher can pick it up, install it, run it, -verify its claims, and cite it. It runs concurrently with the tail -end of Phase 1 (see posture notes above) and culminates in a -coordinated drop: tag → PyPI → Pages → arXiv / JOSS submission for -Paper A → reference in Paper C's Code Availability section. - -### Release definition - -Before enumerating the remaining work, define what "released" means. -A release candidate should satisfy all of the following: - -1. **Installable on a blank machine.** `pip install neuropose` or - `uv pip install neuropose` works on both Linux x86_64 and Apple - Silicon Mac, with no manual steps beyond Python 3.11. -2. **Runnable without the author in the room.** The `docs/` site is - published somewhere persistent (GitHub Pages, Cloudflare Pages), - the getting-started walkthrough actually works end-to-end, and - the MeTRAbs model downloads and verifies on first run. -3. **Verifiable by a reviewer.** CI runs on every push, covers both - Linux and macOS, and a PR from a stranger could be meaningfully - reviewed without access to the research Mac. -4. **Honest about its limits.** Every surface the release advertises - is either exercised in CI or clearly marked experimental. No - false promises in the README or CLI help text. (The `analyze` - stub that motivated this item pre-Phase-0 is now real per Phase - 0's YAML pipeline, so "ship or remove" is no longer open.) -5. **Versioned.** A git tag exists, `__version__` matches, and - `CHANGELOG.md` has a real release section, not just `[Unreleased]`. -6. **Bundled.** Paper A (tech-stack writeup) and Paper C (clinical - validation) cite the release tag, and the release notes cite - them. The three artifacts arrive together; reviewers of either - paper can find and run the code. - -Items below are the gaps between the end-of-Phase-0 state and that -definition. - -### Apple Silicon CI matrix - -**Status:** `RESEARCH.md` lists this as an open next step; no -`macos-14` entry in `.github/workflows/ci.yml`. - -**Why it matters for release:** every claim of "Apple Silicon -support" is currently "by construction" — the TF 2.16+ floor ships -`darwin/arm64` wheels, the MeTRAbs SavedModel has zero custom ops, and -therefore it should work. It has not been empirically confirmed on -real hardware in an automated way. For a public release, we either -verify in CI or we stop claiming Mac support in the README. - -**Scope:** - -- Add a `macos-14` matrix entry to the `test` job (lint and typecheck - stay single-platform, they're platform-independent). -- Exclude `slow` markers on macOS so we don't pay the 2 GB model - download twice per run. -- Accept that the first green macOS run may require two or three - hotfixes — path case sensitivity, `multiprocessing` spawn vs fork, - shared library load order — and budget a day for that. -- Do **not** add a Metal runner. GitHub's `macos-14` runners don't - expose the GPU to TensorFlow in a useful way, and the `[metal]` - extra's numerical verification is a separate task that needs real - M-series silicon we control. - -**Sketch:** - -```yaml -test: - strategy: - fail-fast: false - matrix: - os: [ubuntu-latest, macos-14] - runs-on: ${{ matrix.os }} -``` - -Everything else in the job stays the same; `uv` works identically on -both platforms. - -### Mac hardware validation pass - -**Status:** Unexercised. The Shu Lab research Mac (`100.64.15.110`) is -available; we have an rsync script but no cron job, no automated -smoke check, no numerical-divergence report against the Linux -baseline. - -**Why it matters for release:** CI on GitHub's `macos-14` runners -validates that the wheels install and the tests pass on Apple -Silicon. It does not validate that the real MeTRAbs model loads, that -inference runs, or that `poses3d` on the Mac matches `poses3d` on -Linux within a sane tolerance. Those are different questions, and -answering them against a throwaway runner each time would be wasteful -and unreliable. - -A minimum version of this check — "does `detect_poses` produce -output on the research Mac at all?" — should happen during Phase 0 -regardless, because Paper C will likely run on the same hardware and -a silent numerical divergence there would invalidate the paper's -results. The scope below is the full, release-grade version. - -**Scope:** - -- Run `neuropose benchmark --compare-cpu` against a reference clip on - the research Mac. Capture the resulting `BenchmarkResult` JSON. -- Commit the JSON as `benchmarks/reference/mac_m3_ultra_cpu_v0_1.json` - (a tracked file, not gitignored — this is the reference numerics - we'll compare against going forward). -- Separately, run the `[metal]` path and diff. Record in - `RESEARCH.md` whether divergence is within the ~1e-2 mm budget the - research notes propose, or whether the Metal path is in the "use at - your own risk" column. -- Document the findings as a new section in `RESEARCH.md` ("Apple - Silicon verification, 2026-0X") and close the corresponding - open-question entry. - -**Open question:** should the reference JSON become a test input -(slow-marked integration test that re-runs benchmark on a developer's -machine and asserts divergence from the committed reference), or just -documentation? The former catches regressions automatically at the -cost of a 2 GB model download in the slow job; the latter is cheaper -but easier to ignore. - -### Retention and pruning - -**Status:** `out/` and `failed/` grow forever. No retention config. -No `neuropose prune` command. - -**Why it matters for release:** a research Mac running the daemon -unattended for months will fill its disk. The first support request -will be "the daemon just stopped working" and the answer will be "you -ran out of disk." We can solve this once now, or a hundred times -later. - -**Scope:** - -- Add a `retention_days: int | None = None` setting (default None = - disabled, preserving current behavior). -- When set, the daemon checks on each poll whether any job in - `out/` or `failed/` is older than the threshold and removes it. The - corresponding `status.json` entry transitions to a new `PRUNED` - state (keeping the audit trail) or is removed entirely (keeping the - status file small) — pick one and document. -- Ship a `neuropose prune [--older-than N] [--dry-run]` one-shot - command for operators who want manual control. -- Document in `docs/deployment.md` with a recommended default (30 - days feels right for benchmark/iteration workflows; clinical - deployments would be legal-driven and much longer). - -**Open question:** should pruned jobs' `status.json` entries be -preserved as tombstones (so a user asking "where did job X go?" can -see "pruned 2026-05-01") or removed entirely? Tombstones are more -user-friendly; removal keeps the status file bounded. Default to -tombstones since the status file bound is only a problem at a scale -the 0.1 release won't hit. - -### neuropose doctor preflight - -**Status:** Not implemented. - -**Why it matters for release:** pydantic-settings validates the -*schema* of `Settings` (is `device` a valid string, is -`poll_interval_seconds` positive). It does not validate the -*environment* — is `data_dir` writable, is the lock file acquirable, -is `model_cache_dir` on the same filesystem as `data_dir` (so -`os.rename` works atomically), is the configured TF device actually -available. Each of those is a runtime failure mode that shows up with -an ugly traceback ten seconds after `neuropose watch` starts, and -every one is cheaply detectable at startup. - -**Scope:** - -- New subcommand `neuropose doctor` that runs a battery of - preflight checks and prints a pass/fail table. -- Checks to include: `data_dir` exists and is writable; lock file - acquirable (with clean release); all three subdirectories - (`in/out/failed`) writable; `model_cache_dir` writable and on the - same filesystem as `data_dir`; TF is importable; configured - `device` is in `tf.config.list_physical_devices()`; - `tensorflow-metal` either absent or installed with a version that - advertises support for the installed TF; XDG envvars are sane; - Python version matches `pyproject.toml` floor. -- Exit code 0 if all checks pass, 1 if any warning, 2 if any fatal - failure. -- The daemon's `run()` entry point calls the same underlying - preflight function before entering the poll loop, so - `watch`-without-doctor still gets the benefit. - -**Non-scope:** - -- Do not check for network access to the MeTRAbs download host. - Network-dependent checks make CI flaky and don't match the offline - caching behavior of real operators. - -### Process supervision artifacts - -**Status:** `docs/deployment.md` documents a systemd user unit as -text in prose. No file in `scripts/` that a user can actually copy. -No macOS launchd plist at all. - -**Why it matters for release:** copy-paste from a docs page into a -`.service` file works, but it's friction. An open-source project with -"here is the file, here is where it goes, here is the enable command" -ships deployments faster. - -**Scope:** - -- Ship `scripts/systemd/neuropose.service` as a file with `%h` - placeholders and a short install README. -- Ship `scripts/launchd/org.levineuwirth.neuropose.plist` as a file - with an install README. (Consider making the plist label match the - reverse-DNS of whoever is hosting — either the lab's or - `org.neuropose.daemon` for a vendor-neutral identity.) -- Optional: a `scripts/install_service.sh` that detects the platform - and runs the right install command. Probably not worth the - complexity; a five-line README section per platform is fine. - -**Non-scope:** - -- Do not write installers for init systems we do not personally run - (upstart, sysvinit, runit). If someone needs those, the systemd - unit gives them enough of a template. - -### Structured logging option - -**Status:** Everything logs to stderr via `logging.basicConfig` -with a human-readable formatter. - -**Why it matters for release:** the current format is correct for -interactive use. For any consumer that wants to feed the daemon's -output into Loki, Splunk, Grafana, Datadog, or even `jq`-based -aggregation, JSON-per-line would eliminate a parsing step. This is -a near-free feature if added now and a disruptive formatting change -if added later. It is also a prerequisite for any Track 2 -audit-logging work, so building it now keeps Track 2 options open at -near-zero cost. - -**Scope:** - -- Add a `--log-format={human,json}` global CLI option defaulting to - `human`. -- Implement the `json` variant as a formatter that emits - `{"ts": ..., "level": ..., "logger": ..., "message": ..., ...}` per - line with no log-line wrapping. -- Wire it through `_configure_logging()` so every subcommand benefits - identically. - -**Open question:** do we also want log correlation IDs per job? -That's a bigger change (pushing a context var through the -Interfacer's call stack) and probably Track 2 — skip for 0.1. - -### Monitor authentication - -**Status:** The monitor binds to `127.0.0.1:8765` by default. No -auth, no tokens. `--host 0.0.0.0` works but has a comment warning the -operator to think. - -**Why it matters for release:** loopback-only is a reasonable -default, but the monitor is specifically marketed as the thing -collaborators can watch. "Collaborator" implies a browser somewhere -other than the daemon host. The "correct" answer (TLS, real auth) is -too expensive for 0.1; the "wrong but acceptable" answer (no auth, so -anyone who can reach the port sees everything) is what we have now. -There's a middle ground. - -**Scope:** - -- Add an optional `monitor_token: str | None = None` setting. -- When set, every request to `/` and `/status.json` must carry - `?token=` in the query string or `X-Status-Token` in the - header. No token → 401. -- `neuropose serve` prints a URL including the token on startup, so - operators can copy-paste it. If `monitor_token` is unset, behavior - is unchanged. -- `--host 0.0.0.0` emits a stderr warning if `monitor_token` is unset - — don't block it, just flag it. - -**Non-scope:** - -- TLS. Use a reverse proxy (Caddy, nginx, `ssh -L`) for any - internet-facing exposure. The monitor is not the right place to - terminate TLS. -- Multi-user auth, session cookies, anything with a database. That's - Track 2. - -### Docker GPU image - -**Status:** `Dockerfile` exists (CPU-only). `Dockerfile.gpu` -mentioned in CHANGELOG as planned. - -**Why it matters for release:** a single-file CUDA deployment story -reduces "can I run this on our lab server?" from a 45-minute dance -with conda and CUDA versions to one `docker run`. For Linux GPU -users this is the friction difference between trying the project and -bouncing. - -**Scope:** - -- Write `Dockerfile.gpu` on top of `nvidia/cuda:12.x-runtime-ubuntu22.04` - (pick the version TF 2.18 actually supports — check the - `tensorflow-gpu` compat matrix, not just "latest"). -- Multi-stage: build stage has `uv` and builds the venv; final stage - just copies the venv and sets entrypoints. -- Add a `docker-build.yml` CI workflow that builds both images on - every push to main and publishes as `ghcr.io/neuwirth/neuropose:cpu` - and `:gpu` (or wherever the project ends up hosted). -- Document in `docs/deployment.md` with a `docker run --gpus all` - example. - -**Non-scope:** - -- A `tensorflow-metal` Docker image. Mac can't virtualize Metal, so - there's no point. - -### Dependency freshness automation - -**Status:** No Dependabot, no Renovate. Everything floats until -somebody notices. The recent TF cap tightening (`<2.19`) was caught -manually because a user happened to ask; a scheduled bot would have -flagged it weeks earlier. - -**Why it matters for release:** security CVEs on transitive -dependencies land every few weeks. Without automation, they get -discovered by a downstream user trying to install into an audited -environment. With automation, they become a PR you either merge or -explicitly decline. - -**Scope:** - -- Add `.github/dependabot.yml` with groups: `python-prod`, - `python-dev`, `github-actions`. Weekly schedule. Ignore `tensorflow` - updates until manually cleared (the `tensorflow-metal` constraint - means auto-bumping TF is destructive). -- Alternative: Renovate via `renovate.json`. Renovate has better - grouping and scheduling, Dependabot is simpler and needs no setup - on GitHub. For an open-source Brown-lab project, Dependabot is - enough. -- Add `uv lock --upgrade-package ` to the dev playbook in - `docs/development.md` so PR authors know how to re-lock. - -### Release workflow - -**Status:** `[project.scripts]` is wired for `pip install`, but no -tag-triggered publishing pipeline. `.github/workflows/docs.yml` -uploads the built docs as a 14-day artifact, not to Pages. - -**Why it matters for release:** "release" without a repeatable -publishing flow is a synonym for "one-off person runs hatch build on -their laptop at 11pm before the paper deadline." That is not a -release. - -**Scope:** - -- `.github/workflows/release.yml` triggered on version tags - (`v[0-9]+.[0-9]+.[0-9]+`). Steps: check version matches - `__version__`; build with `hatch build`; publish to PyPI via - trusted publisher (no long-lived token); create GitHub release with - changelog excerpt. -- Flip `docs.yml` to deploy the `site/` output to GitHub Pages on - every push to `main` once the repo is public. Pin the Pages URL in - the README and in `site_url` in `mkdocs.yml` (already points at - `levineuwirth.github.io`, but verify). -- Sign tags with GPG; document the key fingerprint in `SECURITY.md` - (which does not yet exist; create it). -- Consider wiring sigstore signing at the same time — see Track 2 - supply-chain section. Free after the initial setup and buys - everything Track 2 would want without committing to the rest of - that track. - -**Open question:** do we publish under `neuropose`, `brown-neuropose`, -or something else on PyPI? Whichever name, squat it before the paper -drops — waiting means risking namesquatter abuse. - -### Error-path test coverage expansion - -**Status:** Happy paths and a handful of input-validation errors -covered. Not covered: disk full mid-write, corrupt video mid-decode, -OOM during inference, fcntl.flock on NFS (no-op on some kernels), -truncated zip archives, permission denied on data_dir subdirectories. - -**Why it matters for release:** shipping a tool where "happy path -works" is different from shipping a tool where "when it fails, it -fails predictably." For a clinical research pipeline where a crash -mid-job quarantines valuable recording data, fault tolerance is a -feature. - -**Scope:** - -- Systematic pass: for each module, write a `test__failure_modes.py` - enumerating the specific exception classes that can escape and the - corresponding test case that triggers each one. Use `pytest.raises` - with the exact expected exception class. -- Hardest cases use fixtures that monkeypatch system calls - (`os.write` raises OSError(ENOSPC), `cv2.VideoCapture.read` returns - `False, None` partway through, `fcntl.flock` raises OSError(EBADF)). -- Aim: every user-facing error message in the codebase has a test - that proves it's reachable. - -**Non-scope:** - -- Chaos-engineering frameworks. `monkeypatch` is enough. -- Covering unrecoverable errors like SIGKILL of the daemon mid-frame. - That's the recovery-on-startup test, which already exists. - ---- - -## Track 2: Clinical platform (contingent) - -Track 2 is everything beyond the open-source research tool — -multi-tenancy, audit logging, HTTP/API layer, clinician UI, -clinical-system integrations, the works. None of it is sequenced -with Phases 0–2; all of it is gated on specific triggers that don't -exist yet. - -### Triggers to activate Track 2 - -Do not start Track 2 work until at least one of the following is -true: - -1. **External demand.** Another clinical group has asked for a - deployment they can run independently. Not a casual "interesting - project" — a specific ask with a specific cohort and a specific - timeline. -2. **Multi-site ambition.** The Shu Lab decides to run NeuroPose - across more than one site within Brown-affiliated clinical - systems, and the single-host assumption stops working. -3. **Funding mandate.** A grant or contract specifies outputs that - the Phase 0-1-2 deliverables cannot meet (e.g. "produce a - HIPAA-compliant deployment," "integrate with the EHR"). -4. **Publication traction.** Papers A and C get engagement that - translates into demand for a hosted version. Clinical-methods - papers occasionally do. If enough unsolicited inquiries land, - Track 2 becomes worth the investment. - -Before at least one of these triggers: everything below is -background thinking, not planned work. *Do not refactor Phase 0 or -Phase 2 code to make Track 2 easier.* Every such refactor is a bet -on a future that may not arrive. - -### Multi-tenancy and identity - -**What it would require:** - -- A concept of "user" distinct from "OS user." Today `Settings.data_dir` - is one directory per OS user; multi-tenancy means one `data_dir` - serving many logical tenants with enforced isolation. -- Per-tenant namespacing in `in/`, `out/`, `failed/`, and - `status.json`. Cleanest is one subdirectory per tenant with the - same four-directory layout; the daemon's discovery logic becomes a - two-level scan. -- Authentication on the control plane. Passing tenant identity as a - command-line arg is fine for a research prototype; a real - deployment needs OAuth/OIDC or SAML with the institution's IdP - (Brown CAS, epic Auth, whatever the target site uses). -- Authorization model: at minimum, "tenant A cannot see tenant B's - jobs." For clinical deployments, probably also role-based (clinician - / PI / admin / auditor). - -**Cheapest path forward if a trigger fires:** fork the data-directory -layout into `$data_dir//{in,out,failed,status.json}`, -teach the daemon to iterate tenants in its poll loop, add a -`--tenant` flag to the CLI. That's enough for an invitation-only -deployment where tenants are identified by opaque string and issued -out-of-band. - -**Expensive path:** anything involving an identity provider. Don't -go there without a real operator committing to the deployment. - -### Audit logging and compliance posture - -**What it would require:** - -- Append-only log of every data access, write, and configuration - change, with actor identity and timestamp. Separate from the - application log (which rotates). -- Logs streamed to a write-once sink (S3 with object-lock, - immutable journal) so a compromised host can't rewrite the - trail. -- Legal review: what exactly does HIPAA require of this tool? What - about institutional IRB? The answer will differ across sites and - the project cannot prescribe it — but the *capability* to generate - the required logs needs to be built in. -- Retention policy wired to the audit log, not just application - state. Pruning job results is different from pruning audit records. - -**Technical prerequisite:** structured logging from Phase 2 (which -is a cheap add and is scheduled anyway). Without JSON-per-line logs, -audit extraction is a grep-and-pray regex problem. - -### HTTP/API layer - -**What it would require:** - -- Today the control plane is "write files to `in/`." For a - non-filesystem-native consumer (a hosted web UI, a batch scheduler, - a Jupyter kernel in a different container), an HTTP API is the - right abstraction. -- FastAPI or Litestar on top of the existing ingest/interfacer/io - modules. The daemon becomes a long-running process that serves - requests *and* processes the input directory; or the daemon stays - headless and the HTTP layer is a separate process talking via the - same filesystem contract. -- OpenAPI schema published as part of the release so client code can - be generated. - -**Non-obvious pitfall:** the daemon's fcntl-based single-instance -lock assumes one writer. If the HTTP layer is a separate process, it -needs to go through the same ingest API, not directly into `in/`. -That's an easy discipline to establish if designed in from day one, -a painful refactor later. - -**Cheap Phase 0/2 precaution:** keep `neuropose.ingest` and -`neuropose.interfacer` API-stable as Python modules. If a future -HTTP layer imports them, we don't want to break the import. - -### Clinician-facing UI - -**What it would require:** - -- More than the `neuropose serve` dashboard — an actual web - application with clinician-facing views: patient list, session - list, session-level pose visualization, comparison against - reference motion, exportable reports. -- Probably React + TypeScript on the frontend, consuming the HTTP - API above. Backend-rendered templates would be faster to build but - a worse fit for the per-session interaction model clinicians - expect. -- WebGL or Three.js for 3D pose playback. The `neuropose.visualize` - module is a matplotlib-based still-frame tool; rebuilding it for - interactive 3D is a weeks-to-months project on its own. -- Accessibility: clinician environments include keyboard-only users, - users on institutional IE holdovers (yes, still), users with - screen readers. A research-grade UI ignores this; a clinical-grade - one cannot. - -**Scope is enormous.** This is the single largest piece of Track 2 -and would likely dwarf all other Track 2 work combined. Would not -start without dedicated frontend engineering effort. - -### Horizontal scaling - -**What it would require:** - -- A message broker (Redis Streams, RabbitMQ, or NATS) in place of the - filesystem poll. Each job becomes a broker message; multiple - worker processes consume and process in parallel. -- Shared storage for inputs and outputs (S3, MinIO, NFS). The - "job_name is a directory" contract generalizes to "job_name is an - object prefix." -- Per-worker GPU affinity for the multi-GPU case; worker auto-sizing - based on queue depth. -- Distributed lock for the leader-only work (status file writes, - retention enforcement). - -**Upgrade path that minimizes pain:** the current single-process -daemon is equivalent to the "one worker" case of a horizontal -deployment. If the job object in `neuropose.io` stays the source of -truth (not the filesystem layout), the transition is backend-swap, -not architectural surgery. Keep that option open by treating the -filesystem as an implementation detail of `Interfacer`, not a public -contract. - -### Backup, replication, and data durability - -**What it would require:** - -- Outputs (`out//results.json`) currently live on one disk on - one host. For clinical data this is insufficient durability. -- Replication target: another host (hot standby), object storage - (warm archive), or both. The `out/` directory is the canonical - store; replicating it periodically is a scriptable cron job today. -- Proper replication: as writes happen, not as a cron. Either a - daemon-side hook that PUTs to S3 immediately after each - `save_job_results`, or a sidecar process watching the filesystem - with `inotify`/`fswatch`. -- Restore story: how do we restore `out/` from backup without - breaking `status.json` (which refers to job names by convention)? - Test this annually. - -**Minimum viable backup for Phase 2:** add a `scripts/backup.sh` -that rsyncs `$data_dir/out/` to a configurable destination. Not a -feature; a paving-the-path-for-operators artifact. - -### Clinical-system integrations - -**What it would require:** - -- **DICOM** if videos are stored as DICOM instances rather than - MP4. Clinical motion-analysis devices increasingly output DICOM - video; reading DICOM means `pydicom` + some decoding logic. -- **FHIR** for patient metadata. If NeuroPose is to accept a - patient ID and attach it to a session, that ID probably comes - from a FHIR Patient resource. Means speaking FHIR to the hospital's - FHIR endpoint (Epic, Cerner). -- **Redcap** integration for clinical-research cohorts (the Brown - ecosystem uses it heavily). An export script that pulls subject - metadata from a RedCap project and lays it into the ingest - directory is cheap and valuable. - -**Order of likely need:** RedCap first (easy, valuable, Brown-local), -then DICOM (depends on what the recording device outputs), then -FHIR (only if we're pulling from an EHR, which we probably aren't -for research). - -### Deterministic inference mode - -**What it would require:** - -- Phase 0's `Provenance` object already captures model SHA, TF - version, NumPy version, and a seed field. The missing piece for - strict reproducibility is forcing TensorFlow itself to behave - deterministically — - `tf.config.experimental.enable_op_determinism()` plus seeding all - of `random`, `numpy.random`, and `tf.random`. -- A `deterministic: bool = False` setting on `Settings` that flips - the above. Default off, because deterministic mode costs a - meaningful fraction of throughput on GPUs and isn't free on CPUs - either. Clinical deployments would turn it on; benchmark runs - would turn it off. -- A `Provenance.deterministic` boolean field is already in the Phase - 0 scope; this item closes the loop by actually making that - boolean mean something. - -**Cheap Phase 2 precaution:** wire the setting in Phase 2 even if we -don't flip it on by default. Future Track 2 deployments can flip it -without a code change. - -### Observability and SLOs - -**What it would require:** - -- Prometheus metrics endpoint (separate port from the monitor, no - auth needed on metrics, loopback or behind a scraper only). -- Counters: jobs_processed, jobs_failed, frames_processed, bytes_read, - bytes_written. Histograms: per-frame latency, per-job latency, - per-video latency. Gauges: queue depth, active job count. -- Tracing: OpenTelemetry instrumentation on job_process, - detect_poses, save_job_results. Again, the interesting spans are - the long ones, so trace-sampling at 100% is usually fine until - throughput matters. -- Defined SLOs: "99% of jobs complete within 10× video duration," - "95% of monitor requests return in under 100 ms," etc. - SLO definitions go into a `docs/slos.md`; burn-rate alerting is - the operational half. - -**Order-of-magnitude** dependency: none of this is useful without -Track 2 demand. A single-user research Mac doesn't have SLOs. - -### Supply-chain attestation and signed releases - -**What it would require:** - -- SBOM generation on every release (CycloneDX or SPDX format, - attached to the GitHub release and published alongside the wheel). -- Signed releases: sigstore / cosign signatures on the wheel, the - Docker images, and the source tarball. GitHub's OIDC + - sigstore makes this a ten-line workflow once. For a clinical tool, - a reviewer being able to verify "this wheel is the one GitHub - Actions produced from this commit" is non-negotiable. -- Reproducible builds: same source → same wheel hash. Python wheels - are usually reproducible with `SOURCE_DATE_EPOCH` set and `.pyc` - exclusion; document the exact command. -- Provenance attestations (SLSA level 2 or 3) for the CI pipeline. - GitHub's `attestations/build-provenance` action does this. - -**Cheapest Phase 2 precaution:** wire sigstore signing into the -release workflow when it's first built (see Phase 2 release workflow -section). Free after the initial setup. - -### Deployment orchestration - -**What it would require:** - -- Kubernetes manifests (Helm chart, probably). Pod specs for the - daemon, the monitor, the HTTP API. Separate deployments so they - can scale independently. -- Terraform or Pulumi for the underlying infrastructure: GPU - node pool, object storage, IAM, TLS termination. Site-dependent; - Brown runs primarily on-prem with some AWS — the IaC would need - to target both. -- Secrets management: Vault, AWS Secrets Manager, or K8s - Secrets + External Secrets Operator. The monitor token, the - broker credentials, the object-storage keys all need to stop being - env vars in a `.service` file. - -**Strong recommendation:** do not write any of this until there is -a specific deployment with specific operators. Generic K8s manifests -written without a target are a solution in search of a problem, and -they age fast. - ---- - -## Decisions to not prematurely foreclose - -A short list of choices we should avoid making in Phase 0 or Phase 2 -that would make Track 2 more expensive later: - -1. **Keep `neuropose.ingest` and `neuropose.interfacer` API-stable - as Python modules.** A future HTTP layer should be able to import - them. Avoid adding `@staticmethod` decorators that hide internal - state; avoid coupling to global config. -2. **Keep the filesystem layout reversible.** Anything in - `$data_dir` that is not a user artifact should be treated as - internal. If Track 2 wants to replace the filesystem with an - object store, the daemon's only file I/O should be via - `neuropose.io` helpers — no raw opens scattered through the code. -3. **Keep `VideoPredictions.provenance` extensible.** The Phase 0 - `Provenance` model should be a pydantic model so fields can be - added backward-compatibly. Don't pack provenance into free-form - strings or nested dicts that require bespoke parsing. -4. **Keep the CLI subcommands orthogonal.** Do not add subcommands - that wrap multiple other subcommands for convenience; that - creates API shape we'd regret if the right composition layer - later is HTTP, not shell. -5. **Keep model loading behind `neuropose._model`.** A future - self-hosted model registry, signed-artifact verification, or - multi-model switching should be a change in one file, not a - refactor across the estimator. -6. **Keep `Settings` the single source of truth.** No `os.environ` - reads outside pydantic-settings; no sprinkled `Path.home()` - calls. Track 2 almost certainly overrides configuration from - a secret store, and if that override has one place to hook in, - it's easy. -7. **Keep status-file schema owned by pydantic, not hand-written - JSON.** Track 2 multi-tenancy means indexing into the status - file by tenant; a pydantic model refactor is cheap, a - hand-written dict refactor is not. -8. **Keep the `AnalysisConfig` shape additive.** The Phase 0 YAML - schema will evolve through Phase 1 as Paper C's experiments - surface needs. Additions are free (new optional fields); - renames and removals invalidate prior experiments. Pydantic's - `extra="forbid"` catches typos at parse time while still - allowing additive extension. - -These are cheap-now / expensive-later items. Every other Track 2 -decision can wait for a real trigger.