AetherArena ("AA") — the official, project-agnostic Spatial-Intelligence Benchmark
(ADR-149, Accepted). Iteration 1 of the long-horizon build:
- ADR-149 accepted: name locked (ruvnet/aether-arena), v0 metrics locked
(pose/presence/latency/determinism), dataset legality resolved (MM-Fi CC BY-NC
only; Wi-Pose excluded). Adds four-part framing, threat model, arena_score
formula, submission state machine, neutrality/governance, and the §7 acceptance test.
- aa_score_runner: deterministic scorer bin reusing the real ruview_metrics pose
harness on a fixed seed=42 fixture → RuViewTier-style verdict + cross-platform
SHA-256 proof hash. Builds --no-default-features (no torch/GPU). VERDICT: PASS.
- CI harness gate: .github/workflows/aether-arena-harness.yml runs the scorer on
every PR — the "PR that runs the harness as part of the build" requirement.
- Scaffold: aether-arena/{README,VERIFY,STATUS}.md + schema/aa-submission.toml.
- Horizon record persisted (.claude-flow/horizons/aether-arena-aa.json).
Infra = the deliverable; model SOTA (MM-Fi PCK@20) is a separate effort blocked on
ADR-079 data collection, tracked as a stretch goal, not an infra exit.
Co-Authored-By: claude-flow <ruv@ruv.net>
3.5 KiB
AetherArena ("AA") — The Official Spatial-Intelligence Benchmark
Public leaderboard. Private evaluation split. Open scorer. Signed results.
AetherArena is a standalone, project-agnostic benchmark for camera-free spatial intelligence — pose, presence, occupancy, tracking, and vitals from RF/WiFi (and, over time, mmWave / UWB / radar / lidar / multimodal). It is not a single-vendor leaderboard: any team, framework, or sensing modality can enter, and every entrant — including the RuView baseline that donated the seed scorer — is scored by the identical, open, pinned harness.
Specified in ADR-149 (Accepted).
Canonical home: ruvnet/aether-arena + a Hugging Face Space (deploy pending — see STATUS).
Why
WiFi/RF spatial sensing has no shared yardstick — papers self-report against inconsistent splits and metrics, with no accounting for latency, reproducibility, or privacy leakage. AA fixes the measurement, not just the models: a single deterministic scorer, a private held-out split nobody can train on, and a signed result ledger that can't be silently edited.
What gets measured (v0)
| Category | Metric | Status |
|---|---|---|
| Pose | PCK@0.2 (all / torso), OKS | Ranked |
| Presence | accuracy, FP/FN | Ranked |
| Edge latency | p50 / p95 / p99 ms | Ranked |
| Determinism | proof-hash pass/fail | Ranked (gate) |
| Tracking (MOTA) | — | activates when multi-person clips land |
| Vitals (BPM err) | — | activates when paired vitals ground truth lands |
| Privacy leakage | membership-inference ∈ [0,1] | gated — not ranked until the attacker ships |
| Cross-room | degradation ratio | coming soon |
The headline rank is the category metric; an optional arena_score = quality × latency_factor × privacy_factor × determinism_gate is exposed alongside (never instead) so accuracy can't win at any cost. See ADR-149 §2.5.
How scoring works
The scorer is RuView's already-published wifi-densepose-train acceptance harness (ruview_metrics + ADR-145 ablation), run in a pinned sandbox. You submit a model, not predictions — predictions on data you hold prove nothing. Your model is scored against a private MM-Fi held-out split (CC BY-NC 4.0; Wi-Pose excluded for redistribution reasons), and one signed, append-only row is written to the results ledger with a determinism proof hash.
Submission lifecycle: submitted → validated → quarantined → smoke_scored → full_scored → published (or rejected with a reason). The model only ever runs inside a no-network, read-only-FS sandbox.
Submit (when the Space is live)
- Write a manifest:
schema/aa-submission.toml. - Push your model artifact (
.safetensors/.rvf/ LoRA adapter) + manifest to the Space. - Watch it move through the lifecycle; your signed row appears on the board.
Verify it's fair (you don't have to trust us)
See VERIFY.md — run the open scorer locally on the public smoke split, reproduce the determinism hash, and confirm RuView's own entries were scored by the identical path. That five-step check is the launch gate (ADR-149 §7).
Neutrality
AA is a neutral commons. The scorer is open and versioned; any metric change is a public harness_version bump that re-scores all entries. RuView donated the seed harness and enters as one baseline — it gets no special treatment (ADR-149 §2.8).