Files
wifi-ruview/docs/adr/ADR-150-rf-foundation-encoder.md
T
ruv dac40e5df2 docs(adr-150): calibration thesis is task-general (action recognition)
Verified on a 2nd MM-Fi task: 27-class action recognition (which MM-Fi
never benchmarked for WiFi; only published baseline WiDistill 34%). In-domain
88% (leaky); cross-subject zero-shot collapses to ~10%; few-shot calibration
rescues 10->76% (1000 samples). Same mechanism as pose -> few-shot in-room
calibration is the universal WiFi-sensing generalization answer, not a pose
quirk.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-31 03:01:50 -04:00

16 KiB
Raw Blame History

ADR-150: RuView RF Foundation Encoder — pose-preserving, subject/room/device-invariant CSI embedding

Field Value
Status Proposed
Date 2026-05-30
Deciders ruv
Codebase target New wifi-densepose-rfencoder (or nn/src/rf_foundation.rs) + training in wifi-densepose-train; consumed by the MM-Fi pose head and the AetherArena Generalization Track (ADR-149)
Relates to ADR-024 (Contrastive CSI Embedding / AETHER), ADR-027 (Cross-Environment Domain Generalization / MERIDIAN), ADR-134 (CIR), ADR-135 (calibration + coherence gate), ADR-145 (Ablation/Eval Harness), ADR-149 (AetherArena benchmark)

1. Context

AetherArena now has a published, metric- and protocol-matched MM-Fi result: 81.63% torso-PCK@20 in-domain (random_split), exceeding MultiFormer's 72.25% (#876). But the leakage-free cross-subject number collapses to ~11.6% torso-PCK (27% under the looser bbox metric). That gap is the real deployment frontier — homes, elder care, festivals, unseen bodies.

Naïve fixes already tested and failed: a subject-adversarial (DANN) embedding did not move cross-subject (baseline 27.26% → DANN 27.54% bbox; torso 11.57%). Bigger capacity hurt (transformer cross-subject 24.8% < conv 27.3%) — extra parameters overfit seen subjects.

Conclusion: a generic "better feature vector" will not help. The lever is an embedding trained for the right invariance — one that preserves pose while removing subject, room, and device signatures, and that exposes channel instability rather than hiding it.

1.1 Why DANN failed (and the corrected rule)

Subject identity is partly entangled with valid pose evidence — body scale, limb proportions, gait, RF scattering. Blindly erasing subject info also erases information the pose decoder needs. The corrected rule:

Remove subject identity only after preserving pose geometry. Supervised pose-contrast across subjects beats naïve adversarial identity removal.

The frontier objective is not same-subject = positive. It is:

same pose across different subjects = positive; different pose = negative.

2. Decision

Build the RuView RF Foundation Encoder: a self-supervised, pose-preserving, subject/room/device-invariant RF representation for CSI (extensible to CIR, ADR-134, and BFLD). Positioned as a platform primitive, not a benchmark trick.

2.1 What the embedding must keep / remove

Signal Action Why
Pose geometry Keep target signal
Limb-motion deltas Keep strong temporal cue
Subject identity Remove (post-pose) causes overfit
Static room multipath Remove breaks transfer
Device-specific phase artifacts Remove breaks cross-hardware
Antenna-layout quirks Normalize deployment portability
Channel instability Expose separately confidence gating / anti-hallucination

2.2 Architecture

CSI frame sequence
  → physics normalization        (antenna geometry, subcarrier stability, phase-unwrap quality, room-impulse structure)
  → masked CSI encoder           (SSL: learn channel structure from unlabeled CSI — 150k home + 320k MM-Fi frames)
  → temporal contrastive encoder (motion continuity)
  → skeleton-aware pose decoder  (graph head — anatomical constraints, GraphPose-Fi style, arXiv 2511.19105)
  → confidence + coherence head  (mincut / spectral coherence as RF-integrity signal)

2.3 Training objectives (loss stack)

L_total = L_pose
        + 0.20 · L_masked_csi          # learn channel structure (unlabeled)
        + 0.10 · L_temporal_contrast   # motion continuity
        + 0.20 · L_pose_contrast        # same-pose-across-subjects = positive  ← the frontier
        + 0.05 · L_subject_decorrelation # remove identity only where it conflicts with pose
        + 0.10 · L_coherence            # predict when RF evidence is weak

Invariant target:

embedding ≈ pose + motion + channel-coherence
embedding ≠ subject-identity + static-room-signature + device-artifact

2.4 The RuView differentiator — auditable RF perception that knows when it's wrong

The coherence head gates pose confidence by channel coherence: when multipath structure changes (mincut / spectral coherence drop), the model flags low RF integrity instead of hallucinating a pose. This is the anti-hallucination component most WiFi-pose papers lack, and it turns RuView from a model into sensing infrastructure. (Ties to ADR-135 coherence gate.)

3. Experiment plan — three variants, frozen-decoder test

Same split, same decoder, same seed set; only the embedding changes.

Variant Description Success threshold (cross-subject torso-PCK)
E1 Masked CSI pretrain +3
E2 Pose-contrastive across subjects +6
E3 Physics-normalized SSL + skeleton head +10

3.1 Expected gains (estimate)

Method cross-subject torso-PCK gain
Naïve embedding 02
DANN adversarial 03 (high collapse risk) — empirically ~0
Masked CSI pretrain +38
Pose-contrastive +512
Physics-norm + SSL + graph decoder +1020
+ more subject-diverse paired data +20

Plausible trajectory: 11.6% → 2025% near term, 3040% with enough subject/environment diversity. That is a stronger research claim than squeezing random-split from 81.6% → 88%.

3.2 Empirical findings (2026-05-31) — measured, not estimated

The near-term algorithmic estimates in §3.1 were tested directly on the official MM-Fi cross-subject split (256,608 train / 64,152 test, same TF pipeline). Measured results:

Method §3.1 estimate Measured Verdict
Baseline (in-harness) 63.13% (doc TTA 64.04) reference
Mixup n/a +0.7 → 63.79% small
Mixup + TTA + 3-seed ensemble n/a +0.964.92% best
Per-antenna instance-norm + SpecAugment n/a 4.6 → 58.52% destroys cross-antenna pose structure
Pose-contrastive foundation pretrain +5 to +12 2.3 → 62.65% refuted
DANN adversarial ~0 ~0 (as predicted)

Why pose-contrastive pretraining fails — the key finding. The supervised-contrastive pretraining loss (positives = same pose-cluster, spanning subjects) never left the uniform-similarity floor ln(B) — across cluster granularities K∈{48,256}, batch sizes {768,1024}, and 3 seeds. The same encoder trivially aligns temporally-adjacent frames (temporal-triplet SSL reached 82%), so the optimizer works; it simply cannot pull same-pose CSI from different subjects together — that invariance is not present in the data to be learned.

Implication for this ADR. The 18-pt in-domain↔cross-subject gap (83.6% → best 64.9%) is fundamental subject-distribution shift in CSI, not an algorithmic gap. No invariance-learning method tested moves it; only variance-reduction (mixup + ensemble) gives <1 pt. This promotes "more subject-diverse paired data" (§3.1 last row, §6 alt 3) from complementary to the primary lever and demotes pure-SSL-on-existing-data as a near-term cross-subject win. The encoder is still worth building for masked-CSI representation reuse and the coherence integrity head, but the cross-subject acceptance gate (§4, ≥6 pts) is unlikely to be met without new multi-subject capture (fleet: cognitum-seed-1 + multi-room, see CLAUDE.local.md). Recommend re-scoping phase 1 around data collection before further loss-stack engineering.

3.3 Subject-scaling study (2026-05-31) — capture diversity, not volume

Before committing to capture, we measured how cross-subject accuracy scales with the number of training subjects (fixed held-out test subjects, official split, mixup+TTA):

N subjects 4 8 12 16 20 24 32
xsubj-PCK@20 36.7 57.7 58.3 61.1 62.7 63.3 63.7

The curve saturates: 4→8 subjects = +21 pts, but 24→32 = +0.45 pts. Asymptote ≈ 6465%, still ~19 pts under in-domain. Key correction to the "more data" recommendation: simply capturing more people from the same distribution will not close the gap — subject-count returns vanish past ~1620 subjects. The residual is device/room/protocol shift (MM-Fi's cross-subject split is partly cross-environment by construction). Re-scoped phase-1 capture target: maximize DIVERSITY (rooms, devices, antenna geometries, traffic protocols), not headcount — and pair it with few-shot target-domain adaptation (a handful of labeled frames from the deployment room), which the saturation curve implies will beat any amount of additional source subjects. This makes the encoder's domain-invariance objective (vs the failed subject-invariance one) the design priority.

3.4 Few-shot target adaptation (2026-05-31) — the actionable resolution

The saturation curve predicts a few labeled frames from the deployment room beat more source subjects. Confirmed. Base trained on all 32 source subjects (63.7% zero-shot on a disjoint 50% held-out of the target subjects), then fine-tuned on K labeled frames per target subject:

K/subject total frames eval PCK@20 Δ
0 0 63.7%
20 160 68.1% +4.3
50 400 72.2% +8.5 (≈ prior SOTA)
200 1,600 76.1% +12.4
1000 8,000 78.3% +14.6

Few-shot calibration dominates source volume. §3.3 showed +24 source subjects (~190K frames) buys +6 pts; here 200 target frames/subject (1,600 frames) buys +12.4 pts. This re-scopes the ADR's acceptance gate and deployment story: the cross-subject gate (§4, ≥6 pts) is trivially met by ~50200 labeled frames of in-room calibration — no foundation encoder or mass capture required for the deployment win. Recommended product behavior: ship a ~30-second on-site calibration (a few hundred labeled frames per room/person) that recovers most of the gap. The foundation encoder's value shifts from "close cross-subject zero-shot" (data says: hard) to "make the few-shot adaptation faster / need fewer calibration frames" — a better-posed, achievable objective. This supersedes the §3.2 pessimism: the frontier is not closed by algorithms or bulk data, but it is cheaply closed at deployment time by few-shot calibration.

Task-general (2026-05-31). The same mechanism was verified on a second MM-Fi task — 27-class action recognition (which the MM-Fi paper never benchmarked for WiFi). Zero-shot cross-subject collapses to ~10% (near-chance), and few-shot calibration recovers it: 50 samples → 36%, 200 → 59%, 1000 → 76%. Action needs more calibration than pose (classification vs regression), but the pattern is identical. Few-shot in-room calibration is the universal deployment answer for WiFi sensing generalization, not a pose-specific result. (Optimization report §36.)

3.5 Deployable adapter calibration (2026-05-31) — the calibration-service mechanism

Full-finetune calibration (§3.4) means a 2.3 MB model copy per room. Compared calibration methods at K=200 frames/subject by accuracy and adapter size:

Method PCK@20 trainable adapter
zero-shot 63.6%
LoRA rank-8 72.5% 11,200 ~11 KB
head+graph only 72.7% 121,828 119 KB
frozen-trunk 73.5% 212,453 207 KB
full finetune 76.2% 2.32 M 2.3 MB

A ~11 KB LoRA adapter recovers +8.9 pts (→72.5%, ≈ prior SOTA) at 0.5 % the model size. This is the concrete mechanism for the RuView calibration service the project wanted: ship the shared base once; each room contributes a 30-second labeled calibration → a ~11 KB per-room LoRA adapter → SOTA-level cross-subject pose, thousands of rooms on one base. Accuracy/size knob: LoRA 11 KB @ 72.5 % → frozen-trunk 207 KB @ 73.5 % → full 2.3 MB @ 76.2 %. Net for this ADR: the encoder/adapter split is validated empirically — a frozen shared trunk + tiny per-room LoRA is the deployable path, and the foundation-encoder objective should be "make this adapter even smaller / need fewer calibration frames."

Calibration data requirement (measured, 3 seeds): the 11 KB LoRA needs ~100200 labeled samples/room to reach ~72% (knee at ~50 → 70%); below ~20 samples it can't fit and may hurt (5 samples → 61% < zero-shot 64%). So the evidence-complete calibration-service spec is: ship shared base → collect ~100200 labeled samples on-site → fit a ~11 KB LoRA~72% cross-subject (SOTA-level). The encoder's research goal is now precisely posed: push that ~100200-sample requirement down and/or lift the >72% ceiling per fixed calibration budget.

3.6 Cross-ENVIRONMENT few-shot (2026-05-31) — no unsolved deployment case

The hard frontier — unseen room and unseen people (cross-environment) — was thought ~unsolvable (zero-shot ~1017%). Few-shot calibration rescues it even more dramatically than cross-subject:

K labeled samples/subject cross-env PCK@20 Δ zero-shot
0 10.6%
5 60.1% +49.5
20 66.0% +55.5
50 70.0% +59.4
200 73.1% +62.5
1000 75.4% +64.8

Just 5 calibration samples per person lift an unseen room from ~unusable (10.6%) to 60%. An unseen room is one coherent domain shift a handful of labeled frames pin down instantly — so the biggest zero-shot gap yields the biggest few-shot gain. Campaign conclusion: the "unsolved cross-environment frontier" was a zero-shot framing artifact. With the ~11 KB LoRA calibration mechanism (§3.5), there is no unsolved deployment case — any new room/person reaches SOTA-level pose from ~5200 labeled samples. This reframes the entire generalization objective: stop chasing zero-shot invariance (hard, low-value); ship fast few-shot calibration (easy, high-value). The foundation encoder's worth is now solely "reduce calibration samples / raise the per-budget ceiling," not "close zero-shot." Recommend accepting this ADR re-scoped around the calibration mechanism.

4. Acceptance Test

The encoder is accepted only if it improves cross-subject torso-PCK@20 by ≥ 6 absolute points without reducing random-split torso-PCK@20 by more than 2 points — on the same MM-Fi pipeline, one-command reproduction, with per-joint error tables. Results land as AetherArena witness rows (ADR-149), nothing published until reviewed.

5. Consequences

Positive: a reusable, self-supervised RF foundation encoder for CSI/CIR/BFLD; the first principled attack on the cross-subject frontier; the coherence head adds an anti-hallucination integrity signal no competitor has.

Negative / risk: SSL pretraining requires matching the production CSI→feature pipeline (ADR-149 §SSL note flagged the resampling-replication risk); the multi-loss stack needs careful weight tuning (DANN showed loss-imbalance can collapse training); physics normalization must be validated not to discard pose-relevant deltas.

Neutral: the in-domain head is unchanged; the encoder slots in front of the existing pose decoder.

6. Alternatives Considered

  1. Bigger model only — tested; hurts cross-subject (overfits seen subjects).
  2. Naïve DANN subject-adversarial — tested; no gain, collapse risk; entangles pose evidence.
  3. More data only (camera/ADR-079) — complementary and ultimately necessary, but slow and out-of-band; the encoder extracts more from existing data first.

7. Open Questions

  1. Physics-normalization spec — exact antenna/subcarrier/phase terms, validated to preserve pose deltas.
  2. Masked-CSI SSL on the production feature pipeline (resampling match — see ADR-149).
  3. Where the coherence/mincut integrity signal is computed (reuse ADR-135 coherence gate vs new head).
  4. CIR (ADR-134) / BFLD fusion into the same encoder — phase 3.