untrack lab-internal ideation docs

RESEARCH.md and TECHNICAL.md are living R&D / engineering roadmap
notes — pre-meeting drafts, speculative directions, and in-progress
thinking that should evolve freely without public-repo concerns.
Same for docs/research/, a new directory for pre-meeting scoping
artifacts (e.g. the MoCap data-needs spec being drafted for the
upcoming conversation with Dr. Shu).

Files stay on disk in every checkout — the gitignore just stops
them from entering the index. Anything that graduates to a
user-facing artifact moves into docs/ (which is tracked and feeds
mkdocs) rather than these files.
This commit is contained in:
Levi Neuwirth 2026-04-23 09:12:48 -04:00
parent c2e9989f22
commit a4186582fa
3 changed files with 10 additions and 1975 deletions

10
.gitignore vendored
View File

@ -70,6 +70,16 @@ Thumbs.db
# --- Docs site build -------------------------------------------------------
site/
# --- Ideation / lab-notebook docs ------------------------------------------
# Living R&D notes and engineering roadmaps. Kept locally so they can
# evolve freely with in-progress thinking, pre-meeting drafts, and
# speculative directions that don't belong in the public repo. Anything
# under docs/research/ is treated the same way — personal / lab-internal
# working artifacts, not published docs.
/RESEARCH.md
/TECHNICAL.md
/docs/research/
# --- Data and model weights (policy-enforced) ------------------------------
# Runtime job directories, subject data, and downloaded model caches must
# never be committed. The default runtime location is under $XDG_DATA_HOME

View File

@ -1,784 +0,0 @@
# NeuroPose Research and Ideation Notes
A living R&D log for open design questions, speculative directions, and
planned experiments that are larger in scope than individual commits.
This is **not** user-facing documentation — items in here are
*candidates* for future work, and inclusion does not imply commitment.
## How to use this document
- Add a section when you start thinking about a new area of investigation.
- Each section should end with an **Open questions** or **Next steps**
block so it's obvious to a future you (or a new contributor) what the
active threads are.
- When something in here is decided and implemented, move it to the
relevant place in `docs/` or in the code itself and leave a short
pointer behind ("*See `docs/architecture.md` for the resolved design.*").
- Consider the audience: yourself, Dr. Shu, David, Praneeth, and future
contributors. Assume they know pose estimation at a grad-student level
but may not have followed every prior conversation.
## Contents
- [DTW methodology](#dtw-methodology)
- [TensorFlow version compatibility](#tensorflow-version-compatibility)
- [MeTRAbs hosting and extensibility](#metrabs-hosting-and-extensibility)
---
## DTW methodology
### Current implementation (v0.1, commit 10)
`neuropose.analyzer.dtw` ships three entry points, all built on top of
[`fastdtw`](https://github.com/slaypni/fastdtw) with
`scipy.spatial.distance.euclidean` as the point-distance function:
- **`dtw_all(a, b)`** — single DTW on flattened `(frames, joints × 3)`
vectors. One scalar distance for the whole sequence.
- **`dtw_per_joint(a, b)`** — one DTW call per joint, returning a list
of per-joint distances and warping paths. Preserves per-joint
temporal alignment at J× the cost.
- **`dtw_relation(a, b, joint_i, joint_j)`** — DTW on the per-frame
displacement vector between two specific joints. The intent here is
to capture "how does the relationship between these two joints change
over time", which is translation-invariant and so immune to raw
camera-frame changes.
These three correspond directly to the three helpers that existed
(broken) in the previous prototype's `analyzer.py`, ported forward with
bug fixes, types, and tests. **The port was mechanical — not a
methodological choice.** We inherited the FastDTW + Euclidean defaults
without validating them against the clinical research use cases, and
that validation is overdue.
### Known limitations of the v0.1 approach
#### FastDTW is an approximation, not exact DTW
[FastDTW](https://cs.fit.edu/~pkc/papers/tdm04.pdf) is a multi-scale
approximation that runs in linear time by recursively refining a coarse
alignment. For the radius-based implementation in
`slaypni/fastdtw`, the distance is not guaranteed to match exact DTW,
and in pathological cases the error can be significant. For a research
codebase where the DTW distance is going to show up in a figure, that
matters.
**Candidate exact alternatives** (all pip-installable):
- [`dtaidistance`](https://github.com/wannesm/dtaidistance) — C-based,
supports both exact DTW and a `fast=True` approximation; also
supports shape-DTW and various constraint bands. Actively maintained,
and the underlying algorithms match the textbook.
- [`tslearn`](https://tslearn.readthedocs.io/) — ML-flavored toolkit
with exact DTW, soft-DTW (differentiable), Sakoe-Chiba banding, and
kernel-DTW. Good fit if we ever want to feed DTW distances into an
sklearn/PyTorch pipeline.
- [`cdtw`](https://github.com/statefb/dtw-python) / `dtw-python`
Python port of the R `dtw` package; exhaustive options for windowing,
step patterns, and open-ended alignment. Less friendly API but the
most rigorously documented.
#### Euclidean is a choice, not a default
Treating `(x, y, z)` joint positions as a point in R³ and taking
Euclidean distances implicitly assumes the three axes are commensurable
in the same units, which is fine for MeTRAbs (mm) but throws away prior
knowledge about human motion. Alternatives worth considering:
- **Angular distance on joint angles.** Compute joint angles per frame
(`extract_joint_angles` already exists) and run DTW on the angle
sequences rather than raw coordinates. Translation- and
scale-invariant by construction; well-matched to clinical metrics
like knee flexion angle.
- **Geodesic distance on SO(3)** for local joint rotations. Requires a
skeleton-rooted rotation parameterization; more work to set up but
the right metric for "how different are these two poses?" in a
biomechanics sense.
- **Mahalanobis distance** against a learned pose prior. This is the
"machine learning" answer — fit a covariance to a reference corpus
(normal gait from a healthy cohort), then measure distances in the
whitened space. Requires enough data to fit the prior without
overfitting, but makes "is this gait abnormal?" a calibrated question.
#### Preprocessing: what invariance do we want?
The v0.1 implementation is not invariant to anything. Two videos of the
same subject with a different camera position will give a different
DTW distance, which is almost certainly not what a clinician wants.
Candidate preprocessing steps:
- **Translation invariance**: subtract the root joint (pelvis or torso
centroid) from every joint per frame, so all poses are expressed in a
body-relative coordinate frame. Cheap and almost always desired.
- **Scale invariance**: divide by a reference length (e.g., torso
length, or total skeleton span) so tall and short subjects produce
comparable distances. Important for comparing across subjects.
- **Rotation invariance**: align to a canonical frame (e.g., hip-to-hip
vector = x-axis, hip-to-shoulder = z-axis) per frame. Required if the
subject's orientation relative to the camera varies between trials.
- **Procrustes alignment per frame**: fit the best rigid transform
(rotation + translation) between pose A's frame and pose B's frame
before computing distance. The closed-form
[Kabsch algorithm](https://en.wikipedia.org/wiki/Kabsch_algorithm) is
fast and exact. This is likely the *right* thing for most comparison
use cases but has never been wired up.
The `dtw_relation` helper is translation- and (for unit-vector
displacements) scale-invariant by construction, which is why it ends up
being the most useful of the three existing entry points in practice.
#### Representation: coordinates, angles, velocities, or dual?
The v0.1 DTW operates on **3D joint coordinates** (translation-dependent)
or **joint-pair displacements** (`dtw_relation`). Other representations
worth comparing:
- **Joint angles.** Using `extract_joint_angles` output as the DTW
input gives a rotation-and-translation-invariant comparison that's
also directly interpretable in clinical terms.
- **Joint velocities.** Temporal derivatives of position. Emphasizes
*how the pose changes* rather than *what it is* — good for
discriminating smooth from jerky motion in gait.
- **Dual (position + angle).** Concatenate normalized position and
angle features into a single per-frame vector. More expressive but
requires tuning the relative weights.
- **Learned embeddings.** Feed each frame through a pretrained
pose-representation network (there are a few) and DTW on the
embedding space. Expensive and opaque but may capture
higher-order structure.
#### Multi-scale approaches
FastDTW is already multi-scale internally. Other ideas:
- **Coarse-to-fine DTW.** Downsample aggressively, run exact DTW on
the coarse version to get a sub-quadratic alignment, then refine
locally. This is essentially what FastDTW does, but with an explicit
signal-processing hat on.
- **Wavelet-decomposed DTW.** Decompose each joint's trajectory into
wavelet coefficients and run DTW on the low-frequency coefficients.
Unclear whether this actually helps; interesting because it separates
posture (low-frequency) from tremor / micro-motion (high-frequency).
#### Clinical gait: cycle-aware DTW
Gait is approximately periodic, and "the 4th heel-strike of trial A"
is the clinically meaningful comparison point to "the 4th heel-strike
of trial B", not "frame 120 of A vs frame 120 of B". A natural two-stage
approach:
1. **Cycle detection.** Find heel-strikes (or other gait events) via
peak detection on a joint's vertical coordinate, and segment each
trial into individual cycles.
2. **Per-cycle DTW.** Time-warp within each cycle independently to
normalize cycle duration. The distance between trials is then the
sum / mean of per-cycle distances.
This is standard in the biomechanics literature
([Sadeghi et al. 2000](https://doi.org/10.1016/S0966-6362(00)00074-3)
and descendants) and is almost certainly a better fit for clinical
comparison than the naive full-trial DTW we ship today.
#### Soft-DTW for learning applications
[Soft-DTW](https://arxiv.org/abs/1703.01541) is a differentiable
relaxation of DTW, which means gradients can flow through it. This
matters if we ever want to train a network to *learn* a distance
metric or an embedding under a DTW objective — for example, a pose
encoder whose output space is calibrated to gait similarity. Worth
keeping on the radar even if we're not training anything today.
`tslearn` implements it.
### Evaluation strategy
Validating a DTW implementation is harder than validating most things.
Some ideas for how to know we got it right:
- **Synthetic perturbations.** Take a reference sequence and apply
known perturbations (time stretch, added noise, spatial offset) and
verify that distance scales monotonically with perturbation magnitude
and that invariance properties are honored.
- **Reference implementation parity.** For a small set of hand-picked
pairs, compute DTW distance using `dtaidistance` exact DTW and
our implementation, and verify the approximation error is below a
documented threshold.
- **Inter-rater clinical benchmark.** When we have labeled clinical
data, measure how well DTW distance correlates with clinician
ratings of gait similarity. This is the real test but is gated on
having data we can use.
- **Pathology discrimination.** Can DTW distance separate healthy
from impaired gait in a held-out set? This is the usefulness test.
### Open questions
1. Is FastDTW good enough, or should we move to `dtaidistance` exact
DTW as the default? (First concrete experiment: pick 20 pairs from
whatever reference data we can source, compute distance both ways,
see if the approximation error is acceptable.)
2. What's the right representation for clinical gait DTW — raw
coordinates, joint angles, or per-pair displacements?
3. Should we implement Procrustes alignment as a preprocessing step
before any DTW call? (If yes, it belongs in `neuropose.analyzer.features`.)
4. Should the clinical pipeline use cycle-segmented DTW instead of
full-trial DTW? This is a methodological choice with real
downstream implications.
5. Is soft-DTW useful to us, or is it a solution looking for a
problem we don't have?
6. What reference corpus do we use to develop and validate any of this?
### Reading list
- Sakoe, H. & Chiba, S. (1978). "Dynamic programming algorithm
optimization for spoken word recognition." The original DTW paper.
- Salvador, S. & Chan, P. (2007). "Toward accurate dynamic time
warping in linear time and space."
[PDF](https://cs.fit.edu/~pkc/papers/tdm04.pdf). The FastDTW paper.
- Cuturi, M. & Blondel, M. (2017). "Soft-DTW: a Differentiable Loss
Function for Time-Series." [arXiv 1703.01541](https://arxiv.org/abs/1703.01541).
- Sadeghi, H. et al. (2000). "Symmetry and limb dominance in able-bodied
gait: a review." Biomechanics reference for cycle-aware analysis.
- `dtaidistance` documentation —
<https://dtaidistance.readthedocs.io/>. Worth reading even if we
don't switch, for the overview of DTW variants and constraints.
### Next steps
- [ ] Pick 1020 reference pose-sequence pairs and run both FastDTW and
exact DTW on them to quantify the approximation error.
- [ ] Prototype a Procrustes-aligned preprocessing wrapper and
re-run the same pairs.
- [ ] Sketch a cycle-aware DTW pipeline against a gait dataset we can
actually use (identity- and IRB-safe).
- [ ] Decide whether to keep FastDTW as the default or replace it.
- [ ] If we replace it: migrate `neuropose.analyzer.dtw` to the new
backend in a single commit with no API change.
---
## TensorFlow version compatibility
### The question
The pinned MeTRAbs model artifact
(`metrabs_eff2l_y4_384px_800k_28ds.tar.gz`) is a TensorFlow SavedModel.
SavedModels embed a producer TF version and depend on a set of TF op
kernels. Picking a TF version pin that is too low risks Apple Silicon
install pain (pre-2.16 has no native `darwin/arm64` wheel under the
`tensorflow` package name); picking one that is too high risks loading
or runtime failures if MeTRAbs uses ops that have been renamed,
deprecated, or removed. The goal of this investigation was to find the
**minimum** pin that works on Linux x86_64, Linux arm64, and macOS arm64
without forcing platform-conditional dependencies or shipping
`tensorflow-metal` as a default.
### Method
Phase 0 of the procedure laid out earlier in this document was to
inspect the SavedModel directly and run `detect_poses` end-to-end on a
synthetic input. The probe script (`test.py` at the repo root, kept
during the investigation and removed in the same commit that landed the
pin) did three things:
1. Parsed `saved_model.pb` with `saved_model_pb2.SavedModel` and read
the `tensorflow_version` and `tensorflow_git_version` fields out of
each `meta_info_def` to establish the **producer** version.
2. Walked every `node.op` and `library.function[*].node_def[*].op` in
the graph to enumerate the **complete set of ops** the model relies
on. This is the binary-compatibility surface — anything in this set
that gets removed in a future TF release breaks the model.
3. Called `tf.saved_model.load(MODEL_DIR)`, accessed
`per_skeleton_joint_names["berkeley_mhad_43"]`, and invoked
`model.detect_poses(image, intrinsic_matrix=..., skeleton="berkeley_mhad_43")`
on a 288×384 black frame to confirm the consumer TF version actually
*runs* the model (not just loads it — these are different failure
modes).
The probe ran on Linux x86_64 against whatever `uv sync --group dev`
resolved at the time, which was **TensorFlow 2.21.0** with **Keras
3.14.0** — i.e. the most recent TF release as of 2026-04 and a version
that crosses the Keras-3 cutover at TF 2.16.
### Result
- **Producer version:** `tf version: 2.10.0`,
`producer: v2.10.0-0-g359c3cdfc5f`. The model was serialized in
September 2022, consistent with the file mtimes in the extracted
tarball.
- **Custom ops:** **zero**. `tf.raw_ops.__dict__` filtered for
`"metrabs"` returned `[]`. Every op in the SavedModel is a stock
TensorFlow kernel that has been stable since at least TF 2.4.
- **Op inventory** (recorded for posterity so a future contributor can
diff against a newer MeTRAbs release without re-running the probe):
```
Abs, Add, AddV2, All, Any, Assert, AssignVariableOp, AvgPool,
BatchMatMulV2, BiasAdd, Bitcast, BroadcastArgs, BroadcastTo, Cast,
Ceil, Cholesky, CombinedNonMaxSuppression, ConcatV2, Const, Conv2D,
Cos, Cross, Cumsum, DepthwiseConv2dNative, Einsum, EnsureShape, Equal,
Exp, ExpandDims, Fill, Floor, FloorDiv, FloorMod, FusedBatchNormV3,
GatherV2, Greater, GreaterEqual, Identity, IdentityN, If,
ImageProjectiveTransformV3, LeakyRelu, Less, LessEqual, Log,
LogicalAnd, LogicalNot, LogicalOr, LookupTableExportV2,
LookupTableFindV2, LookupTableImportV2, MatMul, MatrixDiagV3,
MatrixInverse, MatrixSolveLs, MatrixTriangularSolve, Max, MaxPool,
Maximum, Mean, MergeV2Checkpoints, Min, Minimum, Mul,
MutableDenseHashTableV2, Neg, NoOp, NonMaxSuppressionWithOverlaps,
NotEqual, Pack, Pad, PadV2, PartitionedCall, Placeholder, Pow, Prod,
RaggedRange, RaggedTensorFromVariant, RaggedTensorToTensor,
RaggedTensorToVariant, Range, Rank, ReadVariableOp, RealDiv, Relu,
Reshape, ResizeArea, ResizeBilinear, RestoreV2, ReverseV2,
RngReadAndSkip, SaveV2, Select, SelectV2, Shape, ShardedFilename,
Sigmoid, Sin, Size, Slice, Softplus, Split, SplitV, Sqrt, Square,
Squeeze, StatefulPartitionedCall, StatelessIf,
StatelessRandomUniformV2, StatelessWhile, StaticRegexFullMatch,
StridedSlice, StringJoin, Sub, Sum, Tan, Tanh, TensorListConcatV2,
TensorListFromTensor, TensorListGetItem, TensorListReserve,
TensorListSetItem, TensorListStack, Tile, TopKV2, Transpose, Unpack,
VarHandleOp, Where, While, ZerosLike
```
- **Load:** `tf.saved_model.load` returned a `_UserObject` with
`detect_poses` exposed. No warnings about deprecated kernels, no
errors. The 11-minor-version forward jump from producer 2.10 to
consumer 2.21 was a non-event, including the Keras 3 cutover at 2.16.
- **Skeleton check:** `per_skeleton_joint_names["berkeley_mhad_43"]` had
shape `(43,)` and `per_skeleton_joint_edges["berkeley_mhad_43"]` had
shape `(42, 2)`, exactly matching what
`tests/integration/test_estimator_smoke.py` asserts.
- **End-to-end inference:** `model.detect_poses` on a black 288×384
frame returned `{'poses3d': (0, 43, 3), 'boxes': (0, 5),
'poses2d': (0, 43, 2)}`, all `float32`. Zero detections is the
correct output for a black frame — the important signal is that the
shapes, dtypes, and key names exactly match what `FramePrediction` in
`neuropose.io` is built to ingest, so the entire estimator pipeline
is wire-compatible with this TF version.
### Decision
Pin `tensorflow>=2.16,<2.19`. Reasoning:
1. **2.16 is the Apple Silicon floor that matters.** TF 2.16 is the
first release with native `darwin/arm64` wheels published on PyPI
under the `tensorflow` package name. Below 2.16, Mac users would
need `tensorflow-macos` (a separate Apple-maintained package), which
forces ugly platform markers in `pyproject.toml` and means Linux and
Mac users run subtly different codebases. Above 2.16, the same
single dependency line installs cleanly on every supported platform.
2. **MeTRAbs imposes no upper bound below 3.0.** Producer 2.10 → consumer
2.21 (an 11-minor-version jump across the Keras 3 boundary) loaded
and ran without a single complaint. The op inventory is 100% stock,
so future TF 2.x releases would only break this if they removed
stable kernels — which would itself be a TF 2.x SemVer violation.
3. **`tensorflow-metal` is an opt-in extra, not a default.**
`tensorflow-metal` is a PluggableDevice that Apple ships separately
to add a Metal-backed `/GPU:0`. It has its own version-compatibility
table (Apple maintains it at
`developer.apple.com/metal/tensorflow-plugin/`), has a documented
history of producing silently-wrong numerics on specific TF ops,
and breaks intermittently on Keras 3. For a clinical-research
pipeline where reproducibility matters more than inference latency,
CPU inference on Mac is the right default. We do ship a
`[project.optional-dependencies].metal` extra that pulls
`tensorflow-metal>=1.2,<2` under darwin/arm64 platform markers, so
users who want the speedup can opt in via
`pip install 'neuropose[metal]'` — but the Metal path is not
exercised in CI, is documented as experimental in
`docs/getting-started.md`, and users are expected to spot-check
`poses3d` output against the CPU path before trusting it for any
clinical measurement.
4. **`tensorflow-metal` forces a TF upper bound.** `tensorflow-metal`
1.2.0 (released January 2025, the latest version as of 2026-04) is
advertised as supporting "TF 2.18+" but in practice fails on
2.19 and 2.20 with symbol-not-found errors and graph-execution
`InvalidArgumentError`s. See
[tensorflow/tensorflow#84167](https://github.com/tensorflow/tensorflow/issues/84167)
and the Apple Developer forum threads at
[developer.apple.com/forums/thread/772147](https://developer.apple.com/forums/thread/772147)
and [developer.apple.com/forums/thread/803658](https://developer.apple.com/forums/thread/803658).
2.18.x is the last version confirmed to work cleanly on Apple
Silicon GPU. Even though the Metal path is opt-in, dependency
resolution is shared — if uv resolves `tensorflow` to 2.21 on a
Linux developer's machine and 2.18 on the Mac, lockfile churn
and "works on my box" become permanent. Cap is therefore applied
globally rather than via a darwin/arm64 marker split. Cost on
Linux is zero: nothing in the pipeline depends on TF 2.19+
features, and the SavedModel ran fine on TF 2.21 in the probe
above, so the cap is purely an external-package constraint. Lift
it once Apple ships a Metal plugin that tracks mainline
TensorFlow again.
### What is **not** yet verified
- The probe ran on Linux x86_64 only. macOS arm64 has not been exercised
on real hardware. The argument that it should work is by construction
`tensorflow==2.16+` ships native arm64 macOS wheels, the SavedModel
uses zero custom ops, and there is no MeTRAbs-side platform code — but
empirical confirmation is still pending.
- Linux arm64 has likewise not been exercised. Same by-construction
argument applies.
- A `macos-14` GitHub Actions matrix entry (which would run the unit
tests on Apple Silicon hardware) is the cheapest way to catch any
regression and is the intended follow-up.
- Inference-output numerics have not been compared across platforms.
This is the next layer of rigor below "does it run" — we expect
fp32 results to match within ~1e-3 mm on `poses3d`, but a real
cross-platform diff against a reference set has not been done.
- The `[metal]` optional-dependencies extra exists in `pyproject.toml`
but the Metal code path has never been exercised against the
pinned MeTRAbs SavedModel. Enabling it is a pure opt-in and comes
with a documented "verify your own numerics" caveat in
`docs/getting-started.md`. Whether it actually produces a speedup
on EfficientNetV2-L-based inference on real clinical videos —
and whether that speedup is worth the numerical-divergence risk
— is unknown.
### Open questions
1. Does the same `detect_poses` call produce numerically equivalent
`poses3d` on macOS arm64 as on Linux x86_64 against a real (non-black)
reference image? Within what tolerance?
2. If a future MeTRAbs release introduces a custom op (e.g. for a new
detector head), how do we want the loader to fail? Currently the
`_REQUIRED_MODEL_ATTRS` interface check would still pass; the failure
would surface at first `detect_poses` call, which is late.
3. ~~Does it make sense to upper-bound the pin more tightly than `<3.0`
(e.g. `<2.22` to bound to tested versions), or is the SemVer guard
sufficient given the all-stock-ops result?~~ **Resolved 2026-04-16.**
Tightened to `<2.19` for `tensorflow-metal` compatibility. See
reasoning point 4 in the Decision section above.
### Next steps
- [ ] Run the same probe on real macOS arm64 hardware and log the
result (load success, detect_poses success, output numerics
diff against the Linux baseline).
- [ ] Add a `macos-14` matrix entry to `.github/workflows/ci.yml` for
the unit tests. Slow tests stay Linux-only to avoid doubling the
MeTRAbs download cost in CI.
- [ ] Re-run the probe whenever MeTRAbs upstream publishes a new model
tarball, and diff the op inventory above. Any new op that is not
in the list above is a flag worth investigating before raising
the pin.
- [ ] Benchmark `[metal]` vs CPU on a real Apple Silicon Mac against
a short reference clip: measure (a) per-frame latency, (b) peak
memory, and (c) `poses3d` divergence from the CPU baseline. If
the speedup is meaningful and the numerics are within
~1e-2 mm, move the `metal` extra from "experimental" to
"supported" in the docs. If not, document the failure mode
here and keep the extra where it is.
---
## MeTRAbs hosting and extensibility
### Current state (v0.1, commit 11)
The model loader in `neuropose._model.load_metrabs_model` will pin the
canonical upstream URL:
```
https://omnomnom.vision.rwth-aachen.de/data/metrabs/metrabs_eff2l_y4_384px_800k_28ds.tar.gz
```
This is the RWTH Aachen "omnomnom" host — a raw HTTP file server run
by the MeTRAbs authors' lab. There is no current HuggingFace mirror
of the relevant MeTRAbs variant at the time of commit 11.
The URL encodes the model configuration:
`metrabs_eff2l_y4_384px_800k_28ds` means the EfficientNetV2-L backbone,
YOLOv4 detector head, 384-pixel input, 800k training steps, trained on
28 datasets. This name pattern is worth preserving when we host the
model ourselves so future variants stay self-describing.
### Supply-chain concerns
Pinning a single upstream URL to a third-party academic host is a
real supply-chain risk, and the audit of the previous prototype called
it out explicitly (the old code used `bit.ly/metrabs_1`, which was
even worse). Concrete failure modes:
- The RWTH Aachen host goes down or is decommissioned.
- The URL changes when Sárándi releases a new MeTRAbs version.
- The tarball contents change under the same URL without a version bump.
**Minimum mitigation** (should land in or immediately after commit 11):
- **Pin a SHA-256 checksum** alongside the URL, and verify on download
before unpacking. If the checksum doesn't match, fail hard with a
clear error.
- **Cache aggressively.** Once downloaded and verified, never hit the
network again for the same configuration. `model_cache_dir` is
already in `Settings`.
- **Document the exact filename and checksum** in `RESEARCH.md` (or
migrate to a `MODEL_ARTIFACTS.md` file) so operators have a way to
manually download the model out-of-band if the primary URL is dead.
### Self-hosting options
We want to host the model ourselves, both for reliability and because
it opens the door to future fine-tuning and redistribution of our own
variants. Candidate hosting approaches:
#### Forgejo LFS
Pros:
- Lives next to the code.
- Version-controlled artifacts.
- Access control mirrors repo access.
Cons:
- LFS is designed for git-tracked binary assets, not for large
infrequently-updated model weights — you pay LFS overhead on every
clone unless you configure `lfs.fetchexclude`.
- Model is ~2.2 GB; Forgejo LFS performance at that size is untested
for our instance.
- Pinning is by LFS pointer, which means the model is coupled to a
particular repo revision. Messy if we want multiple code revisions
to share the same model.
**Verdict:** Workable but not the best fit.
#### Forgejo generic package registry
Forgejo supports a [generic package
registry](https://forgejo.org/docs/latest/user/packages/generic/) that
can host arbitrary binary artifacts with versioned URLs. This is
closer to what we want:
```
https://git.levineuwirth.org/api/packages/neuwirth/generic/metrabs/eff2l_y4_384px_800k_28ds/metrabs.tar.gz
```
Pros:
- Versioned URLs decoupled from repo revisions.
- Upload once, download many times, no clone coupling.
- Integrated auth if we want to gate access.
- Can be made public even if the repo is private.
Cons:
- Requires uploading the file manually or via an API call.
- Forgejo registry size / bandwidth limits depend on the instance.
**Verdict:** Probably the best fit for "we want it hosted alongside
the project."
#### Plain HTTP server on a VPS subdomain
A dedicated subdomain like `models.levineuwirth.org` backed by a
simple HTTP file server (nginx `autoindex`, or Caddy with a tidy
directory layout). Example URL:
```
https://models.levineuwirth.org/metrabs/metrabs_eff2l_y4_384px_800k_28ds.tar.gz
```
Pros:
- Simplest possible story. No API, no auth machinery.
- Easy to mirror from — anyone can curl the URL.
- Decoupled from the git forge, so we can share models publicly even
when the repo itself is private.
- Easy to put a CDN in front (Cloudflare) if bandwidth ever matters.
Cons:
- Manual upload via scp/rsync.
- No access control unless we add it.
- No versioning beyond filename convention.
**Verdict:** Strong candidate. This is probably the right choice for
v0.1 of self-hosted models.
#### S3-compatible object storage (MinIO self-hosted)
Run MinIO on the VPS, get S3-compatible API for free, and serve models
via pre-signed URLs or a public bucket.
Pros:
- Proper object storage with ETags, range requests, multipart uploads.
- Integration story is straightforward if we ever move to cloud-hosted
storage.
- Industry-standard API.
Cons:
- More operational complexity than a plain HTTP server for what might
be a handful of files.
**Verdict:** Overkill for v0.1 but worth revisiting if model storage
becomes a real operational concern.
### Integrity: SHA-256 pinning
Regardless of which hosting approach we pick, **the model loader should
always verify a SHA-256 checksum** before trusting the downloaded
artifact. This is the one piece of supply-chain hygiene that has to be
in place before we ship commit 11 to any user outside the Shu lab.
Implementation sketch for `neuropose/_model.py`:
```python
def load_metrabs_model(cache_dir: Path | None = None) -> Any:
cache_dir = cache_dir or _default_model_cache_dir()
cache_dir.mkdir(parents=True, exist_ok=True)
tarball = cache_dir / _MODEL_FILENAME
if not tarball.exists():
_download(_MODEL_URL, tarball)
_verify_sha256(tarball, _MODEL_SHA256)
extracted = _extract_if_needed(tarball, cache_dir)
return tfhub.load(str(extracted)) # or tf.saved_model.load
```
The `_MODEL_SHA256` constant is the source of truth; if it ever has
to change, the constant change is visible in the git diff and a human
reviews it.
### Fine-tuning
The next research direction after we have inference working is
fine-tuning MeTRAbs on clinical-specific data. Open questions:
- **What data?** Any clinical data is IRB-gated. Even de-identified
pose data may carry subject information if the recording conditions
(lighting, room layout) are distinctive enough. Any training plan
has to run through the data-handling policy that lives (will live)
in `docs/data-policy.md`.
- **Transfer learning strategy.**
- *Head-only fine-tuning*: freeze the EfficientNetV2-L backbone and
re-train the pose regression head on clinical data. Fast, low
compute, unlikely to overfit, but also unlikely to capture
clinical-pose idiosyncrasies.
- *Low-LR full fine-tune*: unfreeze everything, use a learning rate
1/100th of the original, train for a few epochs. Better
adaptation, higher risk of catastrophic forgetting.
- *Adapter layers*: insert small trainable adapters into the frozen
backbone. Parameter-efficient, well-studied in NLP, less common
for pose but should work.
- **Compute requirements.** EfficientNetV2-L is roughly 120M parameters;
fine-tuning on a single modern GPU (24 GB VRAM) is feasible at
reduced batch size. A multi-GPU node is friendlier but not strictly
required.
- **Evaluation.** We need held-out clinical data with trusted ground
truth. MoCap-derived poses are the gold standard; marker-based MoCap
systems provide sub-millimeter accuracy at the cost of subject
instrumentation. The Shu lab's access to MoCap is the gating factor.
- **Sharing fine-tuned weights.** If we fine-tune on clinical data, the
resulting weights may encode subject information in ways that are
non-obvious and potentially IRB-relevant. Sharing fine-tuned weights
externally has to be cleared through the same channels as sharing the
training data.
### Training our own pose estimator
The long-range version of the research direction: train a pose
estimator from scratch that extends MeTRAbs's methodology. MeTRAbs is
a good starting point because the method is well-documented:
- Sárándi, I., et al. (2020). "MeTRAbs: Metric-Scale Truncation-Robust
Heatmaps for Absolute 3D Human Pose Estimation."
[arXiv 2007.07227](https://arxiv.org/abs/2007.07227),
IEEE Transactions on Biometrics, Behavior, and Identity Science.
Core contributions (worth knowing if you modify any of this):
- **Truncation-robust heatmaps.** Instead of predicting a 2D heatmap
bounded by the image, MeTRAbs predicts a heatmap that extends
*outside* the image and can place a joint at coordinates the image
alone could not disambiguate. Critical for crops where the subject
is partially out of frame.
- **Metric scale regression.** MeTRAbs predicts the absolute 3D
positions of joints in millimetres by combining a 2D heatmap with a
per-joint depth regressor. Most 3D pose methods produce only
relative coordinates, which are useless for clinical measurement.
- **Multi-dataset training with a common skeleton.** The 28-dataset
training set unifies disparate skeleton topologies into a common
43-joint Berkeley MHAD skeleton, which we carry forward in
NeuroPose.
**Natural extensions worth considering:**
- **Temporal smoothing head.** MeTRAbs is a per-frame model. Clinical
gait analysis wants temporally smooth trajectories. Adding a
lightweight temporal head (1D CNN or small transformer over frame
sequences) could produce smoother outputs without touching the
backbone.
- **Clinical-specific heatmap supervision.** If we have MoCap data for
clinical poses, we can use it as ground-truth heatmap supervision to
improve accuracy in the pose ranges the model sees least often in
the 28-dataset training corpus (e.g., pathological gaits, walker-
assisted ambulation).
- **Multi-person identity tracking.** MeTRAbs produces detections per
frame without continuity across frames. Adding a Hungarian-matched
tracker (or a learned tracker) would solve the multi-person
identity problem that `predictions_to_numpy` currently dodges with
a `person_index` parameter.
- **Alternative backbones.** EfficientNetV2-L is a 2020-era choice.
Newer backbones (ConvNeXt, DINOv2-initialized ViTs) may give
meaningful gains, especially for clinical poses that are
under-represented in the original training set.
- **Uncertainty estimation.** Clinical users want to know when the
model is unsure. A Gaussian output head (mean + variance per joint)
or an ensemble-based approach would let us propagate uncertainty
into downstream analysis.
**Compute requirements:** training MeTRAbs from scratch was reported
as "a few weeks" on 8x V100 in the original paper. A from-scratch
re-training is a substantial undertaking. Fine-tuning is much more
accessible.
### Collaboration opportunities
- **István Sárándi** (now at University of Tübingen, formerly RWTH
Aachen) is the author of MeTRAbs. The code is MIT-licensed and he
has historically been responsive to collaboration requests. If we
end up publishing work that significantly extends MeTRAbs, at the
very least we should reach out about co-authorship or
acknowledgment; at best we might find an active collaborator.
- **The Shu Lab's existing collaborators** on clinical gait research
at Brown and partner institutions may have MoCap-validated datasets
we can use for fine-tuning and evaluation. Worth asking Dr. Shu.
### Open questions
1. Does Forgejo's generic package registry actually handle a 2.2 GB
upload cleanly, or do we need the plain HTTP server route?
2. What's the right SHA-256 pin to commit alongside the URL? (Need to
download the tarball first and run `sha256sum`.)
3. Do we have access to MoCap-validated clinical gait data for
fine-tuning evaluation? This gates every training-related
experiment.
4. Is fine-tuning even worth pursuing before we have inference results
that are clearly *not* good enough on clinical data? (I.e.,
motivate the work with concrete failure cases rather than assuming
a delta we haven't measured.)
5. Does it make sense to reach out to Sárándi now, or wait until we
have something concrete to collaborate on?
### Reading list
- Sárándi, I. et al. (2020). "MeTRAbs: Metric-Scale Truncation-Robust
Heatmaps for Absolute 3D Human Pose Estimation."
[arXiv 2007.07227](https://arxiv.org/abs/2007.07227). **Essential
reading** for anyone planning to extend the method.
- Sárándi's personal site and the MeTRAbs GitHub repo
(<https://github.com/isarandi/metrabs>) — the code, model zoo, and
training scripts live here.
- Zheng, C. et al. (2023). "Deep Learning-Based Human Pose Estimation: A
Survey." Good survey paper for orienting on the state of the art.
- The original 28-dataset training composition referenced in the
MeTRAbs paper — worth tracing through to understand what poses are
in- and out-of-distribution for the pretrained model.
### Next steps
- [ ] Download the pinned tarball and compute its SHA-256 for the
commit-11 model loader.
- [ ] Decide between Forgejo generic registry and plain HTTP subdomain
for self-hosting. Prototype whichever one wins.
- [ ] Mirror the pinned tarball to the chosen self-hosted location so
we can fall over to it the moment the RWTH URL changes or goes
down.
- [ ] Write a one-page "MODEL_ARTIFACTS.md" that documents every model
version we use, its checksum, and its canonical source URL.
- [ ] Have the data-access conversation with Dr. Shu about clinical
training data. Everything else is blocked on this.
- [ ] (Much later) Reach out to Sárándi about potential collaboration
once we have something concrete to talk about.

File diff suppressed because it is too large Load Diff