neuropose/docs/deployment.md

4.7 KiB

Deployment

This page covers running NeuroPose in production — on a research server, in a container, or as a managed system service. The target audience is whoever is actually setting up the pipeline for a study.

!!! warning "Data handling policy" Before deploying NeuroPose against subject data, read the (pending) docs/data-policy.md — it describes the IRB constraints on retention, sharing, and derived-data handling. If you are reading this before the data policy has landed, pause and ask the project lead before proceeding.

Choosing a deployment mode

Mode Use when Notes
Local (bare) Developer machine, one-off experiments Fastest feedback loop. Use neuropose process.
Systemd service Single-host lab server Recommended for study runs. Auto-restart, log capture, clean shutdown.
Docker Shared infra, CI pipelines, reproducible runs Image build is pending commit 12.
Kubernetes Multi-study labs with shared GPU pools Not currently supported; would layer on top of the Docker image.

Local (bare-metal)

For one-off processing, the CLI is enough:

neuropose --config ./config.yaml process path/to/video.mp4

For batch mode, run the daemon in a tmux or screen session:

tmux new -s neuropose
neuropose --config ./config.yaml --verbose watch
# Ctrl-B D to detach

Systemd user service

A systemd user unit (not a root-privileged one) is the right way to run the daemon on a research server where the researcher owns the job queue.

Create ~/.config/systemd/user/neuropose.service:

[Unit]
Description=NeuroPose job daemon
After=network-online.target

[Service]
Type=simple
WorkingDirectory=%h/neuropose
Environment=XDG_DATA_HOME=%h/.local/share
ExecStart=%h/neuropose/.venv/bin/neuropose --config %h/neuropose/config.yaml watch
Restart=on-failure
RestartSec=10

[Install]
WantedBy=default.target

Enable it:

systemctl --user daemon-reload
systemctl --user enable --now neuropose.service
journalctl --user -u neuropose.service -f

The interfacer's fcntl-based lock file prevents a second daemon from starting if systemd restarts it before the first instance has fully released the lock.

Docker

Pending commit 12. The plan is to ship two Dockerfiles:

  • Dockerfile — CPU base, suitable for small studies.
  • Dockerfile.gpu — CUDA base derived from tensorflow/tensorflow:<pinned>-gpu.

Both images will have the neuropose command as their ENTRYPOINT so they can be invoked as:

docker run --rm \
  -v /srv/neuropose:/data \
  -e NEUROPOSE_DATA_DIR=/data/jobs \
  -e NEUROPOSE_MODEL_CACHE_DIR=/data/models \
  ghcr.io/.../neuropose:latest \
  watch

GPU considerations

  • NeuroPose delegates device selection to TensorFlow via the device field in Settings ("/CPU:0" or "/GPU:0"). No multi-GPU dispatch yet — a single daemon instance uses a single device.
  • If you need to run inference on multiple GPUs in parallel, run one daemon per GPU with distinct data_dir values and divide jobs between them. The fcntl lock is keyed on the data directory, so separate daemons on separate data dirs do not conflict.
  • The first call to Estimator.process_video triggers MeTRAbs model load, which in turn initializes the TensorFlow GPU runtime. Expect a one-time startup delay of several seconds.

Log management

The daemon writes to stdlib logging. Under systemd, logs land in the user journal. For other deployment modes, redirect stdout/stderr to your log collector of choice — NeuroPose writes one line per event with a structured %(asctime)s %(levelname)-8s %(name)s: %(message)s format, which any log aggregator can parse.

Log verbosity is controlled via the CLI:

neuropose --verbose watch   # DEBUG
neuropose watch             # INFO (default)
neuropose --quiet watch     # WARNING

Monitoring

The canonical state of the daemon lives in $data_dir/out/status.json, which is a JSON object keyed by job name. A tiny Prometheus exporter or a nightly cron that tails the file is enough to alert on stuck jobs. A richer monitoring story is out of scope for v0.1.

Backups and retention

Two things are worth backing up:

  1. $data_dir/out/*/results.json — the aggregated predictions for each job. These are the outputs of the research process.
  2. $data_dir/out/status.json — the daemon's record of which jobs ran when, which failed, and why.

Do not back up $data_dir/in/ or $data_dir/failed/ indiscriminately. These contain source video files that may be IRB-protected subject data, and your backup store may not be covered by the same data-handling agreement as the primary server. Consult the (pending) docs/data-policy.md before designing a retention plan.