Files
wifi-ruview/docs/adr/ADR-102-edge-module-registry.md
rUv 67fec45e61 feat(edge-registry): ADR-102 — surface Cognitum cog catalog via /api/v1/edge/registry (#648)
* feat(edge-registry): ADR-102 — surface Cognitum cog catalog via /api/v1/edge/registry

Adds a new sensing-server endpoint that fetches and caches the canonical
Cognitum app registry at
https://storage.googleapis.com/cognitum-apps/app-registry.json (105 cogs
across 11 categories as of v2.1.0). RuView previously had no live
awareness of the catalog — the README's capability table was hand-
curated and went stale as Cognitum shipped new cogs (the registry was
last updated 6 days ago).

ADR:
* docs/adr/ADR-102-edge-module-registry.md — full design, response
  shape, configuration flags, failure modes, and a 12-row security
  review covering SSRF, response inflation, ?refresh abuse, stale-serve
  semantics, TLS, cache poisoning, JSON-panic resistance, etc.

Code:
* v2/.../edge_registry.rs — EdgeRegistry struct + UreqFetcher +
  MockFetcher trait + 7 unit tests. RwLock<Option<CachedEntry>> with
  stale-on-error fallback. MAX_PAYLOAD_BYTES=8 MiB, 10s wire timeout.
* v2/.../main.rs — constructs Option<Arc<EdgeRegistry>> at startup,
  registers GET /api/v1/edge/registry handler, wires Extension layer.
  Handler runs the blocking ureq fetch via tokio::task::spawn_blocking
  so the async runtime stays free.
* v2/.../cli.rs / main.rs Args — three new flags (per user request to
  "allow the registry to be disabled or changed"):
    --edge-registry-url <URL>       (env RUVIEW_EDGE_REGISTRY_URL)
    --edge-registry-ttl-secs <N>    (env RUVIEW_EDGE_REGISTRY_TTL_SECS)
    --no-edge-registry              (env RUVIEW_NO_EDGE_REGISTRY)
  When --no-edge-registry is set or the URL is empty, the endpoint
  returns 404.

Cargo.toml: adds ureq (rustls), sha2, thiserror as direct deps.

README:
* New collapsed "🧩 Edge Module Catalog" section with the full 105-cog
  table generated from the registry, grouped by category with practical
  one-line descriptions (e.g. "Spots irregular heartbeats and abnormal
  heart rhythms", "Detects walking problems and scores fall risk").
  Links to https://seed.cognitum.one/store and the local appliance
  /cogs page. Sits between the HF model section and How It Works.

Tests (7/7 pass):
  first_call_hits_upstream_and_caches
  ttl_expiry_triggers_refetch
  force_refresh_bypasses_fresh_cache
  stale_serve_on_upstream_failure_after_cached_success
  no_cache_no_upstream_returns_error
  upstream_invalid_json_is_treated_as_error
  upstream_sha256_is_deterministic

Security highlights (full review in ADR-102 §"Security review"):
- The registry is metadata-only; per-cog binary signatures (ADR-100)
  remain the trust root for installs. A compromised registry can
  mislead a human reader but cannot ship malicious binaries.
- 8 MiB cap + 10s timeout + Option<Arc<...>> via Extension layer means
  the endpoint can't be used to exhaust memory or pin tokio threads.
- Stale-on-error responses carry an explicit `stale: true` field so
  upstream outages are visible to consumers rather than silently
  masked.
- Endpoint sits behind the existing RUVIEW_API_TOKEN bearer gate when
  set, otherwise unauthenticated (registry contents are public anyway).

* chore: refresh Cargo.lock for ureq/sha2/thiserror deps added by ADR-102
2026-05-19 18:08:43 -04:00

12 KiB
Raw Permalink Blame History

ADR-102: Edge Module Registry Integration

  • Status: Accepted
  • Date: 2026-05-19
  • Deciders: ruv

Context

The Cognitum app ecosystem publishes a canonical app store catalog at:

https://storage.googleapis.com/cognitum-apps/app-registry.json

As of v2.1.0 (2026-05-13) the registry advertises 105 cogs across 11 categories (health, security, building, retail, industrial, research, ai, swarm, signal, network, developer). Each entry carries id, name, category, version, description, size_kb, difficulty, sha256, binary_size, and a config[] schema describing the runtime parameters the appliance offers when installing the cog.

RuView today has no live awareness of this catalog. The README.md capability table is hand-curated; the UI surfaces only the capabilities the dashboard's HTML knows about; nothing in wifi-densepose-sensing-server references the registry. Result: when Cognitum ships a new cog (the registry was last updated 6 days ago — a fast cadence), RuView stays unaware until someone manually edits the README. Customers running the RuView dashboard against a real appliance see a 10-capability bag in the UI while the appliance is actually capable of installing 105 cogs.

Today's cog-pose-estimation@0.0.1 release (PRs #642 / #643, ADR-100, ADR-101) is the first cog this repo ships to that registry. We need the discovery side to match.

Decision

wifi-densepose-sensing-server will fetch app-registry.json on demand, cache it in process memory with a TTL, and serve it back through a new endpoint:

GET /api/v1/edge/registry
GET /api/v1/edge/registry?refresh=1   (force-bypass cache, log if abused)

The registry is passively surfaced, not modified. RuView is a presentation layer for the canonical Cognitum catalog; it never re-signs entries or re-hosts binaries.

Module

v2/crates/wifi-densepose-sensing-server/src/edge_registry.rs — small, ~150 lines.

pub struct EdgeRegistry {
    cached: RwLock<Option<CachedEntry>>,
    ttl: Duration,
    upstream_url: String,
}

struct CachedEntry {
    payload: serde_json::Value,
    fetched_at: Instant,
    upstream_sha256: String,
}

Cache semantics:

  • TTL 3600 s (1 hour) by default — registry updates land on a roughly-weekly cadence and a stale-by-an-hour catalog is fine.
  • ?refresh=1 bypasses the cache but writes a debug log so accidental abuse is visible.
  • On upstream fetch failure when the cache is non-empty, serve the stale cached copy with a stale: true marker in the response and a 200 status (preserve UI), not a 5xx.
  • On upstream fetch failure when the cache is empty, return 503 with the upstream error in the body.

Response shape

{
  "fetched_at": 1779200000,           // server-side fetch timestamp
  "ttl_seconds": 3600,
  "stale": false,                     // true when serving past TTL because upstream is down
  "upstream_url": "https://storage.googleapis.com/cognitum-apps/app-registry.json",
  "upstream_sha256": "<sha256-of-payload-bytes>",
  "registry": { /* full canonical JSON as returned upstream */ }
}

The registry field is the upstream JSON inlined verbatim so consumers don't need to make a second hop. upstream_sha256 lets a paranoid consumer compare against a pinned hash.

Trust / verification

  • Bucket is public-read with object versioning enabled (per ADR-100 §"GCS misconfiguration risks").
  • The cog-level binary_sha256 + binary_signature (ADR-100) are the trust roots for installs. The registry itself is not signed today.
  • We deliberately do not add a signature requirement to the registry JSON in this ADR — that would block the integration on a parallel infrastructure project. A future ADR can layer signature checks on top once the publisher pipeline emits them.

UI surfacing

New page ui/edge-modules.html renders the registry into category sections with cog cards. Each card links out to the Cognitum V0 appliance's /cogs page (http://cognitum-v0:9000/cogs#<id>) for the install action — RuView itself never installs.

The existing dashboard's "Capabilities" section continues to show RuView-native sensing capabilities (presence, breathing, pose, etc. — the things RuView itself runs); the new edge-modules page shows the broader Cognitum cog catalog. The two are distinct surfaces and shouldn't be merged.

Failure modes

Scenario Behaviour
Upstream returns 200 with valid JSON Cache it, return it.
Upstream returns 200 with invalid JSON Treat as failure; serve stale if available else 503. Log the upstream sha + the parse error.
Upstream returns 4xx / 5xx Same as JSON-invalid: serve stale if available else 503.
TLS / DNS / timeout error Same.
Upstream is permanently moved Operator updates the upstream_url config (CLI flag added). No code change required to migrate registries.

Configuration

  • --edge-registry-url <URL> — override the default (default: https://storage.googleapis.com/cognitum-apps/app-registry.json)
  • --edge-registry-ttl-secs <N> — override the cache TTL (default: 3600)
  • --no-edge-registry — disable the endpoint entirely (returns 404). For air-gapped deployments.

Consequences

Positive

  • One source of truth for the cog catalog across RuView + Cognitum dashboards.
  • Zero ongoing maintenance: when Cognitum publishes registry v2.2.0, RuView sees it within an hour without a release.
  • The endpoint is also useful for non-UI consumers (CI checks, fleet automation, third-party integrations).
  • Lets us deprecate the hand-curated README capability table in favour of generated content (separate PR).

Negative

  • Adds an outbound HTTP dependency to the sensing-server. Air-gapped deployments must use --no-edge-registry.
  • Stale-but-served behaviour can mask upstream outages from operators. Mitigation: include stale: true + fetched_at in the response so the UI can render a "registry possibly out of date" badge.

Risks

  • Upstream rug-pull: if cognitum-apps is deleted or replaced, the endpoint goes dark. The --edge-registry-url flag lets operators repoint without a code change. Long-term, RuView could mirror the registry into its own GCS bucket if the relationship requires it.
  • Cache poisoning: the upstream is public-read; an attacker who breaches Cognitum's GCS write could push a bad registry. The cog-level signatures (ADR-100) limit the blast radius — bad registry entries can't install bad binaries, only show wrong metadata. Acceptable until registry-level signing lands.

Security review

A real review of the attack surface this endpoint introduces.

Threats considered

# Threat Mitigation in this ADR
T1 SSRF — operator-supplied --edge-registry-url redirects fetches to an internal target Flag is operator-only (CLI / env) — there is no API endpoint to mutate it at runtime. Operators are already trusted (they control the binary).
T2 Outbound dependency reveals deployment — a passive observer of the egress sees the appliance phoning home to GCS Documented in the docstring + the runtime startup log. Operators wanting offline deployments use --no-edge-registry.
T3 Malicious upstream registry — Cognitum's GCS bucket is breached and a poisoned app-registry.json is served Two layers absorb this: (a) the registry's role is discovery only — installs verify the per-cog binary_sha256 + binary_signature (ADR-100); a wrong description string can mislead a human, but a wrong binary still has to pass Ed25519 against COGNITUM_OWNER_SIGNING_KEY. (b) The endpoint exposes upstream_sha256 so a paranoid operator can pin the expected registry hash externally and alert on drift.
T4 Response inflation — upstream returns a multi-GB payload to exhaust memory MAX_PAYLOAD_BYTES = 8 MiB cap (current registry is ~50200 KB). Exceeding cap returns an error without buffering past the cap.
T5 Slow upstream blocking server threads — Slowloris-style stall on the fetch 10-second wire timeout via ureq::AgentBuilder. Per-handler fetch runs inside tokio::task::spawn_blocking so a stalled fetch never blocks the async runtime.
T6 Denial via ?refresh=1 abuse — unauthenticated callers force-bypass the cache repeatedly Cache lives in process; ?refresh=1 triggers a single upstream fetch behind a synchronous code path. A flood of refresh requests is rate-limited by the upstream's own throttling (GCS) and locally serialised by Rust's RwLock. Refresh requests are logged at debug so abuse is visible. Follow-up: add per-IP rate-limit middleware if seen abused (separate PR; tracked in #574-style follow-up).
T7 JSON deserialisation panics — malformed registry triggers a Rust panic Payload is parsed as serde_json::Value (opaque untyped tree) — never coerced into a strongly-typed struct that could panic. Failure is propagated as FetcherError::Network which the handler maps to 503.
T8 Stale-on-error masks outages from operators Response carries stale: true + fetched_at (unix timestamp). UI rendering MUST surface this badge — encoded as an explicit field, not an implicit silence.
T9 TLS downgrade / MITM on the fetch ureq is built with the tls feature (rustls) by default. No --insecure flag exists. If the upstream uses LetsEncrypt the cert chain is system-trusted; certificate pinning is out of scope (would block the bucket from rotating certs).
T10 Unauthenticated access exposes what cogs exist The registry is canonical-public information (already public-read on GCS via anonymous HTTP GET). Surfacing it on a local LAN HTTP API does not increase its disclosure. The endpoint stays under the project's existing RUVIEW_API_TOKEN Bearer auth — when set, the registry is gated like other /api/v1/* routes.
T11 Configuration injection via env varRUVIEW_EDGE_REGISTRY_URL set to a malicious URL by an attacker who controls the process environment If an attacker controls the env, they own the process; this is not a new threat surface. Documented in the CLI help.
T12 Cache mutation across threads / poisoning The cache is RwLock<Option<CachedEntry>>. Writes go through cached.write() once per fetch. Snapshot reads clone() the CachedEntry (cheap — Value is reference-counted internally for large strings) so concurrent readers don't share mutable state. Tests cover the multi-call path; no unsafe is used.

What this ADR does NOT secure

  • Registry-level signing — the JSON payload itself is unsigned. If/when Cognitum's publisher pipeline emits a registry sig (e.g. detached .json.sig), a follow-up ADR will require it. Today the per-cog binary signature (ADR-100) is the actual trust root for installs; the registry is metadata.
  • Per-client rate-limiting on ?refresh=1 — relies on the upstream's own throttling. If we see abuse we'll add a token-bucket middleware; not needed for v0.0.1.

Testing

Test What it verifies
first_call_hits_upstream_and_caches Single fetch, then cache hit
ttl_expiry_triggers_refetch Cache TTL bound respected
force_refresh_bypasses_fresh_cache ?refresh=1 semantics
stale_serve_on_upstream_failure_after_cached_success T8 explicit (stale: true returned)
no_cache_no_upstream_returns_error T3/T5 — error propagated cleanly when nothing to fall back on
upstream_invalid_json_is_treated_as_error T7 — malformed payload doesn't panic
upstream_sha256_is_deterministic T3 — hash field is reliable for external pinning

All 7 tests in src/edge_registry.rs::tests pass.

Migration

  1. Land this ADR + the implementing PR.
  2. UI: ship ui/edge-modules.html and link from index.html.
  3. After two clean releases of the endpoint, remove the hand-curated "Capabilities" table from README.md and replace with a small "see the appliance for the full catalog" pointer.
  4. Future ADR: registry signing once Cognitum's publisher pipeline emits a sig.

See also

  • ADR-100: Cognitum Cog Packaging Specification (binary trust model).
  • ADR-101: Pose Estimation Cog (the first repo-shipped cog visible in the registry).
  • v0-appliance ADR-220: Cog management surface (where this registry is the input to install actions).
  • docs/benchmarks/pose-estimation-cog.md: the per-cog benchmark format this ADR's response shape complements.