Files
wifi-ruview/docs/benchmarks/pose-estimation-cog.md
T
rUv 3314c8db8d feat(cog-pose-estimation): scaffold first Cog from this repo (ADR-100 + ADR-101) (#642)
* feat(cog-pose-estimation): scaffold first Cog from this repo (ADR-100 + ADR-101)

Adds the foundation for the pose-estimation Cog that ships from this
repo into Cognitum V0 appliances. Companion ADR-225 + crate land in
cognitum-one/v0-appliance.

ADRs:
* ADR-100 formalises the Cognitum Cog packaging spec — on-device
  layout under /var/lib/cognitum/apps/<id>/, manifest.json schema
  (incl. new binary_sha256 + binary_signature fields), GCS hosting
  convention, repo source layout, build pipeline, and the four-verb
  runtime contract (version | manifest | health | run). Documents the
  convention I reverse-engineered from inspecting installed cogs on a
  live cognitum-v0 appliance — `anomaly-detect`, `presence`,
  `seizure-detect`, etc.
* ADR-101 designs the pose-estimation Cog itself: where it sits in
  the wifi-densepose pipeline (encoder init from
  ruvnet/wifi-densepose-pretrained, 17-keypoint regression head),
  what gets shipped per target arch (arm / x86_64 / hailo8 /
  hailo10), acceptance gates (PCK@20 explicitly deferred to #640 —
  this ADR ships the vehicle, not the accuracy).

Crate v2/crates/cog-pose-estimation/:
* Cargo.toml + workspace member declaration with a hailo feature gate
  so the binary builds without the Hailo SDK in CI.
* main.rs implements the four-verb CLI exactly per ADR-100.
* config.rs / manifest.rs / publisher.rs / inference.rs / runtime.rs —
  small modules, each <100 lines.
* publisher.rs emits ADR-100 structured JSON events.
* inference.rs is a stub that produces a centred-skeleton baseline
  with confidence=0 (honest: no trained weights wired in yet).
* runtime.rs subscribes to /api/v1/sensing/latest, slides a
  56*20 window, runs the engine, emits pose.frame events.
* cog/manifest.template.json + cog/config.schema.json define the
  release artifact + runtime config schemas.
* cog/Makefile holds build / sign / upload targets.
* tests/smoke.rs covers manifest roundtrip + engine I/O surface.

Verified locally:
* cargo check -p cog-pose-estimation: clean.
* cargo test  -p cog-pose-estimation: 4/4 pass.
* ./target/release/cog-pose-estimation {version,manifest,health}:
  all emit the right contract output.

This commit contains scaffolding only; the actual trained weights and
Hailo HEF cross-compile come in follow-ups tracked in #640 and the
companion v0-appliance branch.

* feat(cog-pose-estimation): first measured run — Candle CUDA on RTX 5080

Trained pose_v1 on ruvultra (RTX 5080) via Candle 0.9 + cuda feature
against the same 1,077-sample paired session that produced 0%/0% PCK
in #640 with the pure-JS SPSA trainer. First real numbers:

  PCK@20 = 3.0%   (up from 0.0%)
  PCK@50 = 18.5%  (up from 0.0%)
  MPJPE  = 0.093  (down from 0.66, ~7x improvement)

400 epochs in 2.1 s wall time, full-batch, ~5 ms/epoch. Loss curve
0.181 -> 0.014 over the run, eval 0.010. Per-joint reveals the model
leans on right-side proximal joints (r_hip 77% PCK@50, r_knee 35%,
l_elbow 26%) — consistent with the camera framing in the source
recording. Distal joints (wrists, ankles) and face joints are still
near-random, consistent with the 56-subcarrier / 20-frame input not
carrying fine-grained spatial info at 1077 samples.

This commit:

* Adds v2/crates/cog-pose-estimation/cog/artifacts/{pose_v1.safetensors,
  train_results.json} so the cog dir now contains a real reference
  artifact, not just scaffold.
* Updates cog/README.md "Status" block with the measured numbers,
  per-joint table, and an honest reading of where the model
  succeeds vs where the data is the bottleneck.
* Adds docs/benchmarks/pose-estimation-cog.md as the canonical
  benchmark log — append-only, one section per published run.
* Appends a "First measured run" section to ADR-101 referencing
  the new benchmark file.

Still pending in the follow-up:
* Wire pose_v1.safetensors into src/inference.rs (replace stub).
* ONNX export (Candle lacks a writer — needs external conversion).
* Hailo HEF cross-compile + cluster deploy.

The data-bound gap to PCK@20 >= 35% is tracked in #640.

* feat(cog-pose-estimation): wire real weights — cog is no longer a stub

Replaces the centred-skeleton stub in src/inference.rs with a real
Candle-based loader that reads cog/artifacts/pose_v1.safetensors and
runs the trained Conv1d encoder + MLP pose head on every incoming CSI
window.

What changes:

* src/inference.rs: PoseNet mirrors the training script's architecture
  exactly — Conv1d(56->64, k=3 d=1), Conv1d(64->128, k=3 d=2),
  Conv1d(128->128, k=3 d=4), mean over time, Linear(128->256)+ReLU,
  Linear(256->34)+sigmoid -> reshape [17, 2]. The InferenceEngine
  searches a sensible candidate list for the weights file
  (/var/lib/cognitum/apps/pose-estimation/, ./pose_v1.safetensors,
  ./cog/artifacts/, repo-root, v2/-relative) and falls back to the
  stub when none are present so the cog still satisfies ADR-100.
* Cargo.toml: adds candle-core 0.9 + candle-nn 0.9 (no-default-features,
  CPU build by default) + safetensors 0.4. New `cuda` feature opt-in
  for GPU inference on hosts that have it. Drops the unused
  wifi-densepose-train path dep from the default build path.
* src/main.rs + src/publisher.rs: health.ok event now carries
  `backend` (candle-cuda | candle-cpu | stub) and the synthetic
  output confidence, so operators can tell at a glance whether the
  cog loaded its weights or fell back to the stub.
* tests/smoke.rs: adds `real_weights_load_when_available` which
  asserts the loaded engine reports backend=candle-* and emits
  non-zero confidence — exactly the signal that proves we're not
  silently degrading to the stub.

Verified locally:

* `cargo check -p cog-pose-estimation --no-default-features` — clean
* `cargo test  -p cog-pose-estimation --no-default-features` — 5/5 pass
* `./target/release/cog-pose-estimation health` emits:
  {"event":"health.ok","fields":{"backend":"candle-cpu","cog":"pose-estimation","synthetic_output_confidence":0.185}}
  — 0.185 is the published PCK@50 from cog/artifacts/train_results.json,
  emitted by the real Candle inference path (would be 0.0 if it had
  fallen back to the stub).

The cog now runs the trained pose_v1 model end-to-end. Accuracy is
still bounded by the underlying 1077-sample training data (PCK@20
3.0%, PCK@50 18.5% per docs/benchmarks/pose-estimation-cog.md) — that
gap is data-bound and tracked in #640. ONNX export + Hailo HEF
cross-compile remain follow-ups.

* docs(benchmarks): measure cog-pose-estimation cold-start latency

100 sequential `cog-pose-estimation health` invocations average 76.2 ms
each on a Windows x86_64 host using the `candle-cpu` backend. Each
invocation re-loads pose_v1.safetensors and runs one synthetic forward
pass, so this is the worst-case cold-start path. Long-running `run`
inference will be sub-millisecond per frame once the model is loaded.

Updates the benchmarks doc accordingly.

* feat(cog-pose-estimation): ONNX export — pose_v1.onnx + scripts/export-onnx.py

Adds the canonical ONNX artifact that unblocks downstream Hailo HEF
cross-compile + ONNX Runtime benchmarks. Generated on ruvultra (torch
2.12.0 + CUDA), 12,059 bytes, opset 18, dynamic batch axis.

* scripts/export-onnx.py: mirrors the Candle inference architecture in
  PyTorch (Conv1d 56->64, 64->128, 128->128 + Linear 128->256->34), pure-
  python safetensors loader (no extra pip dep), exports via
  torch.onnx.export, then verifies via onnx.checker.check_model and
  numerical parity against the torch reference.
* Verified parity vs torch: max |torch - onnx| = 8.94e-8 (1e-5
  threshold). Effectively bit-perfect.
* v2/crates/cog-pose-estimation/cog/artifacts/pose_v1.onnx — the
  artifact itself, 12 KB.
* docs/benchmarks/pose-estimation-cog.md — adds an ONNX export
  section with the verification numbers.

Next: Hailo HEF cross-compile (still gated on Hailo SDK on a
self-hosted runner) and ONNX Runtime latency benchmarks on each
target arch.

* feat(cog-pose-estimation): release v0.0.1 — signed aarch64 binary on GCS

End-to-end deploy: cross-compiled to aarch64-unknown-linux-gnu on
ruvultra, ran via qemu-aarch64-static, then smoke-tested on a real
cognitum-v0 Pi 5. Signed with COGNITUM_OWNER_SIGNING_KEY (Ed25519)
and uploaded to gs://cognitum-apps/cogs/arm/.

Real-hardware results on cognitum-v0 (Pi 5):
  health: backend=candle-cpu, confidence=0.185, real weights loaded
  30x sequential `health`: 0.251 s total -> 8.4 ms / invocation (cold)

GCS release artifacts (publicly downloadable):
  binary:  3,741,976 bytes
    sha256 1e1a7d3dd01ca05d5bfc5dbb142a5941b7866ed9f3224a21edc04d3f09a99bf5
  weights:   507,032 bytes
    sha256 eb249b9a6b2e10130437a10976ed0230b0d085f86a0553d7226e1ae6eae4b9e5
  signature (Ed25519, b64): LUN7xqLPYD3MFzm5dKB5MnYU0LvoRtek5ci5KiKPHBg+Xo6xuazwokn2Dw2JPMaLYJzmWn/SpT4djuR7hYvVDw==

Adds:
* v2/crates/cog-pose-estimation/cog/artifacts/manifest.json — the
  release-pipeline-produced manifest with all fields filled in per
  ADR-100, including arch, target_triple, signature, and a
  build_metadata block carrying the validation PCK numbers.
* docs/benchmarks/pose-estimation-cog.md — new sections covering
  the real Pi 5 smoke (8.4 ms cold-start) and the signed GCS
  release artifacts.

Verified by downloading the binary anonymously from GCS and
re-computing the sha256 — matches the locally-computed sha exactly.
Signature decoded to the expected 64-byte Ed25519 length.

Closes the GCS-upload acceptance criterion from ADR-100; the only
pending work is Hailo HEF cross-compile (still SDK-gated) and an
x86_64 release alongside this arm release.

* docs(benchmarks): record live cognitum-v0 install + 5-sec smoke run

Adds the "Live appliance install" section documenting what happened
when the signed v0.0.1 binary + weights were installed under
/var/lib/cognitum/apps/pose-estimation/ on cognitum-v0 (the V0
cluster leader).

* Layout matches the existing anomaly-detect / presence / seizure-
  detect cogs exactly — the Cogs dashboard at
  http://cognitum-v0:9000/cogs auto-discovers entries.
* `cog-pose-estimation run` ran for 5 seconds in the background and
  cleanly emitted run.started + structured WARN events for the
  missing local sensing-server on :3000 (cognitum-v0's actual CSI
  source is ruview-vitals-worker on :50054, not :3000). No crashes,
  no NaN, no leaks.
* Wiring `sensing_url` to the appliance-native source is a separate
  Day-2 integration task.
2026-05-19 17:03:09 -04:00

8.4 KiB
Raw Blame History

cog-pose-estimation — Benchmark Log

This file tracks every published benchmark for the pose-estimation Cog. New runs append; never overwrite history. Per ADR-101 §"Acceptance gates".

v0.0.1 — first measured run (2026-05-19)

Setup

Component Value
Training host ruvultra (Ubuntu 6.17, x86_64, RTX 5080)
Backend candle-core 0.9 with cuda feature
Data data/paired/wiflow-p7-1779210883.paired.jsonl — 1,077 paired samples, 30-min seated-at-desk recording, avg conf 0.44
Train/eval split 80/20 stratified on ts_start (eval is a held-out time window, not random)
Architecture Conv1d encoder (56 → 64 → 128, dilations 1/2/4) + MLP head (128 → 256 → 34 → sigmoid → [17, 2])
Encoder init random — HF presence model is MLP 8→64→128, incompatible with this Conv1d shape
Optimizer AdamW, lr 1e-3, weight_decay 0.01
LR schedule Cosine with 50-epoch warm restarts
Loss SmoothL1 (Huber β=0.1), confidence-weighted by record.conf
Augmentation Subcarrier dropout 10% (final 50 epochs)
Epochs 400 (full-batch)
Wall time 2.1 s total

Accuracy

Metric Value
PCK@20 (overall) 3.0%
PCK@50 (overall) 18.5%
MPJPE (normalized) 0.0931
Final eval loss 0.0101
Loss reduction 0.181 → 0.014 (13×)

Per-joint PCK

Joint PCK@20 PCK@50 Joint PCK@20 PCK@50
nose 0.5% 5.1% l_hip 0.0% 27.3%
l_eye 2.8% 8.3% r_hip 25.0% 76.9%
r_eye 1.9% 15.7% l_knee 2.3% 20.8%
l_ear 0.0% 3.2% r_knee 0.9% 35.2%
r_ear 1.9% 9.7% l_ankle 1.4% 7.9%
l_shoulder 4.6% 8.8% r_ankle 0.9% 9.3%
r_shoulder 1.9% 19.9% l_elbow 1.9% 26.4%
l_wrist 3.2% 24.1% r_elbow 0.0% 4.2%
r_wrist 1.4% 12.0%

Strongest signal at right-side proximal joints (r_hip 77% PCK@50, r_knee 35%, r_shoulder 20%) — consistent with the camera framing during data collection (operator's right side most consistently in frame).

Comparison to prior baseline

Run Backend Train time PCK@20 PCK@50 MPJPE
pre-2026-05-19 pure-JS SPSA, lite TCN (#640) ~20 min 0.0% 0.0% 0.66
v0.0.1 (this run) candle-cuda, Conv1d TCN 2.1 s 3.0% 18.5% 0.093

7× MPJPE improvement, 570× faster training, signal-bearing PCK at all proximal joints. The remaining gap to ADR-079's PCK@20 ≥ 35% target is data-bound, not infra-bound (see Issue #640).

Inference latency

Measured on Windows host (x86_64, no GPU — candle-cpu backend) running the release binary:

Mode Measurement Notes
Cold start 76.2 ms / invocation (avg over 100 sequential health invocations) Includes safetensors load + 1 synthetic forward pass. Most of the cost is process startup + mmap.
Long-running run warm inference sub-millisecond per frame (estimated) The model is 125K params / 507 KB; once loaded, a single forward at batch=1 is essentially memory-bandwidth bound. To be measured precisely against a live sensing-server feed.

ONNX export

pose_v1.onnx is produced from pose_v1.safetensors by scripts/export-onnx.py, which mirrors the Candle architecture in PyTorch, loads the safetensors weights, and uses torch.onnx.export with opset 18 + dynamic batch axis. Verified end-to-end:

Check Result
onnx.checker.check_model ok
Parity vs torch reference max |torch onnx| = 8.94e8 (1e5 threshold)
File size 12,059 bytes
Dynamic axes batch on input and output

The ONNX artifact is the input to the Hailo Dataflow Compiler (HEF cross-compile) and to ONNX Runtime CPU/GPU benchmarks on each target arch — both still pending.

Real-hardware smoke (cognitum-v0 Pi 5)

Cross-compiled to aarch64-unknown-linux-gnu on ruvultra and run on a live Cognitum-V0 appliance:

Host Mode Result
ruvultra (under qemu-aarch64-static) health backend: candle-cpu, confidence: 0.185 — real weights loaded under emulation
cognitum-v0 (Raspberry Pi 5, Cortex-A76) health backend: candle-cpu, confidence: 0.185 — real weights, real hardware
cognitum-v0 30× sequential health invocations 0.251 s total → 8.4 ms / invocation (cold)

8.4 ms cold-start on real Pi 5 hardware vs 76 ms on the x86_64 Windows host. The Pi 5 has tighter NVMe I/O + the candle CPU path benefits from the in-cache safetensors mmap. Long-running run warm inference will still be sub-millisecond.

Release artifacts (signed + published to GCS)

gs://cognitum-apps/cogs/arm/cog-pose-estimation-arm                       3,741,976 bytes
gs://cognitum-apps/cogs/arm/cog-pose-estimation-pose_v1.safetensors         507,032 bytes

binary_sha256:  1e1a7d3dd01ca05d5bfc5dbb142a5941b7866ed9f3224a21edc04d3f09a99bf5
weights_sha256: eb249b9a6b2e10130437a10976ed0230b0d085f86a0553d7226e1ae6eae4b9e5
signature:      LUN7xqLPYD3MFzm5dKB5MnYU0LvoRtek5ci5KiKPHBg+Xo6xuazwokn2Dw2JPMaLYJzmWn/SpT4djuR7hYvVDw==   (Ed25519, signed with COGNITUM_OWNER_SIGNING_KEY)

Full manifest at cog/artifacts/manifest.json. Verified via public anonymous GET against https://storage.googleapis.com/cognitum-apps/cogs/arm/cog-pose-estimation-arm — downloaded SHA matches the locally-computed SHA.

Live appliance install

Installed on cognitum-v0 (the V0 cluster leader) at /var/lib/cognitum/apps/pose-estimation/:

$ ls -la /var/lib/cognitum/apps/pose-estimation/
-rwxr-xr-x  cog-pose-estimation-arm   3,741,976 B   (matches GCS sha256)
-rw-r--r--  pose_v1.safetensors         507,032 B
-rw-r--r--  manifest.json                   989 B
-rw-r--r--  config.json                     187 B
-rw-r--r--  output.log                   28,438 B   (5-sec smoke run)

Layout matches the existing anomaly-detect, presence, seizure-detect, etc. cogs on the same appliance — the Cogs dashboard at http://cognitum-v0:9000/cogs auto-discovers entries under this dir.

cog-pose-estimation run ran cleanly in the background for 5 seconds with the default config. It correctly:

  • Emitted a run.started event with the configured sensing_url, model_path, and poll_ms.
  • Started its 40 ms poll loop.
  • Gracefully handled the missing local sensing-server on port 3000 by logging structured WARN events ({"level":"WARN","fields":{"message":"sensing-server fetch failed","error":"...Connection refused..."}}) without crashing, leaking, or producing NaN output.
  • Exited cleanly on SIGTERM.

0 pose.frame events fired during the smoke run — expected, since 127.0.0.1:3000 isn't serving CSI on the appliance. The appliance's actual CSI source is ruview-vitals-worker on :50054 plus the /api/v1/v0/system/... endpoints behind the appliance's bearer auth on :9000. Wiring sensing_url to the appliance-native source is a Day-2 integration task — separate from the cog binary itself.

Pending separately:

  • Hailo HEF cross-compile (gated on Hailo SDK on a self-hosted runner) — uses pose_v1.onnx as input.
  • Appliance-native sensing-source integration (config.sensing_url should point at the cog-gateway's CSI tap on :9000, not the dev-loopback :3000).
  • x86_64 release upload (today's release is arm-only).

Artifacts

  • v2/crates/cog-pose-estimation/cog/artifacts/pose_v1.safetensors — 507 KB
  • v2/crates/cog-pose-estimation/cog/artifacts/train_results.json — full per-epoch loss curve + hyperparameters + per-joint PCK

Reproducibility

# On any host with cargo + a CUDA-capable GPU:
cd ~/work/cog-pose-train
mkdir -p ./
# Stage the same inputs (1,077 paired samples + HF encoder, see scripts/align-ground-truth.js for regeneration)
cp paired.jsonl ./paired.jsonl
cp encoder.safetensors ./encoder.safetensors

# Build & train (no Python, no pip)
cargo new --bin pose-trainer && cd pose-trainer
# Edit Cargo.toml deps: candle-core 0.9 (cuda), candle-nn 0.9 (cuda), safetensors, serde, serde_json, anyhow
# Drop the training script into src/main.rs (see this repo's training-tooling examples for reference)
cargo run --release

candle-core 0.8.4 + 0.9.2 are typically already in ~/.cargo/registry/cache/ on any developer host, so the build completes in seconds.