Files
wifi-ruview/docs/benchmarks/wifi-pose-efficiency-frontier.md
T
ruv 96ccfa58fb bench: ship int4 edge artifact + CPU latency
Published deployable int4-QAT micro (verified 74.08%, ~20KB) at
ruvnet/wifi-densepose-mmfi-pose/edge. Runs 0.135ms single-thread x86 CPU
(no GPU) - real-time pose without an accelerator. ARM on-device validation
pending fleet availability.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-31 01:30:29 -04:00

5.5 KiB
Raw Blame History

WiFi-CSI Pose — Efficiency Frontier (beyond SOTA at a fraction of the size)

Measured: 2026-05-31 · MM-Fi random_split (ratio 0.8, seed 0) · RTX 5080 · torso-normalized PCK@20 (MultiFormer Table VII metric: ‖predgt‖ ≤ 0.2·‖R-shoulder L-hip‖).

The flagship ruvnet/wifi-densepose-mmfi-pose reaches 83.59% torso-PCK@20 (vs MultiFormer 72.25%, CSI2Pose 68.41%). But the headline number isn't the whole story for edge deployment — on a Raspberry Pi / ESP32-class target, params and latency matter as much as accuracy. So we swept model size to map the accuracy-per-parameter frontier: how small can a WiFi-CSI pose model be and still beat the prior published SOTA?

The frontier

Model Params Latency (batch=1) torso-PCK@20 vs SOTA (72.25%)
nano 39,971 0.126 ms 71.76% 0.49 (58× smaller than flagship)
micro 75,237 0.224 ms 74.30% +2.05 — beats SOTA at 31× fewer params
tiny 210,949 0.299 ms 76.82% +4.57
small 348,005 0.287 ms 77.87% +5.62
base 726,437 0.344 ms 79.38% +7.13 (3.2× smaller)
flagship 2,320,869 83.59% +11.34

Every configuration from micro (75K params) upward beats the prior published state of the art, and even nano (40K params, 0.13 ms) lands within half a point of it — at ~1/58th the flagship's parameter count. A 75,237-parameter model tops MultiFormer's 72.25%.

Deployable footprint AND deployed accuracy (quantized micro)

Size alone isn't the claim — what matters is accuracy at the deployed precision. Measured (weight-only, per-tensor symmetric):

Precision Size torso-PCK@20 vs SOTA 72.25
fp32 294 KB 74.73% +2.5
int8 (PTQ) 73.5 KB 74.70% +2.5 — essentially lossless
int4 (naïve PTQ) 36.7 KB 70.21% 2.0 — drops below SOTA
int4 (QAT) 36.7 KB 74.46% +2.2 — recovered, still beats SOTA

The honest edge result: micro is lossless at int8 (73.5 KB, 74.70%), and at int4 (36.7 KB) naïve post-training quantization falls below SOTA (70.21%) — but quantization-aware training fully recovers it to 74.46%, still beating MultiFormer. So a SOTA-beating WiFi-pose model genuinely runs in ~37 KB int4 (with QAT) or ~73 KB int8 (no retraining) — deployable on the sensing node itself. nano (40K params) sits at the SOTA line in fp32 and is best treated as int8.

(We also tested flagship→tiny knowledge distillation: it did not help — the tiny students reach equal or higher accuracy from ground truth alone, so regression-KD on keypoints only adds teacher noise. Direct training wins.)

Shipped as a usable artifact. The int4-QAT micro model is published and downloadable at ruvnet/wifi-densepose-mmfi-pose/edge (pose_micro_int4.npz + load_int4.py): verified deployed int4 accuracy 74.08% (beats SOTA), ~20 KB int4 weight payload, sha256 c03eeb…. It runs in 0.135 ms single-thread on x86 CPU (no GPU) — i.e. real-time pose with no accelerator; a Raspberry-Pi-class ARM core would be slower but still comfortably real-time. (Latency measured on ruvultra x86; on-device ARM validation pending the Pi fleet coming back online.)

Why this matters

  • Edge-native pose. micro/tiny (75210K params, sub-0.3 ms on a discrete GPU) are small enough to quantize and run on a Pi-class / Hailo edge node next to the sensing pipeline — no cloud round-trip, no camera.
  • Pareto-dominant, not just smaller. These aren't accuracy-traded-for-size compromises below SOTA; they are simultaneously smaller than MultiFormer and more accurate than it.
  • Orthogonal to the accuracy frontier. Unlike cross-subject/cross-environment generalization (which is data-bound — see ADR-150 §3.2), the efficiency frontier responded immediately to optimization. This is the lever that's still open.

Method & reproduction

Same architecture family as the flagship — input [3,114,10] CSI amplitude → linear projection → L-layer / H-head Transformer encoder over the 10 temporal tokens → temporal attention pooling → MLP head → skeleton-graph refinement (COCO bone topology) — with width d, depth L, heads H swept. Training: mixup (Beta(0.2,0.2)), 4-view test-time augmentation, EMA, cosine LR.

Model d L H graph head
nano 48 1 2
micro 64 1 2
tiny 96 2 4
small 128 2 4
base 160 3 4

Reproduce: python aether-arena/staging/train_efficiency_pareto.py npy/X.npy npy/Y.npy npy/split_random.npy (MM-Fi parsed via aether-arena/staging/parse_mmfi_zips.py). Latency is mean of 200 batch-1 forward passes after 10 warmups on an RTX 5080; expect different absolute numbers on edge hardware but the same param/accuracy ordering.

Controlled claim. In-domain random_split (the dataset's documented default) — the same protocol on which MultiFormer reports 72.25%. Random split has temporal/subject-adjacency effects common to this benchmark family; it is in-domain accuracy, not solved cross-subject/-environment generalization (those remain ~65% / ~17% — the honest frontier, tracked in ADR-150).