| 1 |
# Sando observability architecture |
| 2 |
|
| 3 |
Status: draft, 2026-06-03. Argument shape, not a checklist. |
| 4 |
|
| 5 |
Goal: an error and visibility surface built on newtypes. Sando's current daemon and TUI agree by string convention — `tier: String`, `version: String`, `gate: String`, `detail: Option<String>` — and every consumer (schema, WS payloads, TUI render) reparses or pattern-matches those strings independently. The result is that gate failure classification only exists as prose in `gate_runs.detail`, and the TUI's only recourse on red is "open `/logs/<version>/<gate>` and read the tail." This document proposes the type graph that replaces the strings, the boundary parses that mint the types, and the persistence + wire shapes that follow. |
| 6 |
|
| 7 |
This is one cohesive design. The launchplan splits it into "Phase B (typed GateOutcome)" and "Phase C (live tail + remaining newtypes)"; that split is a delivery convenience, not an architectural seam. The migration order below preserves the option to ship Phase B first, but the types it introduces must already be the final shape, not a string stand-in to be newtyped later. |
| 8 |
|
| 9 |
## What is currently stringly-typed |
| 10 |
|
| 11 |
Inventory from a read of `daemon/src/gates.rs`, `daemon/src/events.rs`, `daemon/src/topology.rs`, `daemon/migrations/001_init.sql`, and `tui/src/main.rs:186-266`: |
| 12 |
|
| 13 |
|
| 14 |
|
| 15 |
| Tier | `tier: String` | `GateCtx`, every `Event` variant, `tiers.name`, `gate_runs.tier`, `tier_state.tier`, `deploys.tier`, TUI `str_v(v.get("tier"))` | |
| 16 |
| Node | `node: String` (sometimes `Option`) | `Tier.nodes[].name`, `deploys.node`, `Event::Deploy*` | |
| 17 |
| Version | `version: String` (semver `0.8.12`) | `GateCtx`, `versions.version`, every event, TUI display | |
| 18 |
| Git sha | `sha: String` (short or full?) | `versions.git_sha`, `Event::RebuildRequested/BuildStart/BuildOk/BuildFailed/BuildAborted`, `/logs/<sha>/<gate>` per launchplan §3 | |
| 19 |
| Gate kind | `Gate` enum *inside the daemon*; `gate_kind: &'static str` once it crosses the events/schema boundary; reparsed in TUI | `gates::kind_str`, `gate_runs.gate_kind`, `Event::Gate*.gate`, TUI `match kind` | |
| 20 |
| Gate outcome detail | `detail: Option<String>` | `GateOutcome.detail`, `gate_runs.detail`, conflates: config error, missing prerequisite, real failure with stderr tail, burn-in clock progress, "waiting on operator" | |
| 21 |
| Deploy outcome | `outcome TEXT DEFAULT 'in_progress'` | `deploys.outcome`; no enum on the Rust side | |
| 22 |
| Deploy failure | `error: String` | `Event::DeployFailed.error`; freeform | |
| 23 |
| Backup source | `source: String` | `backups.source`, `Event::BackupFetched.source`; an ssh URL today | |
| 24 |
| Canary policy | typed (`CanaryPolicy`) at config load, then `as_str()` to a `TEXT` column | `tiers.canary` | |
| 25 |
|
| 26 |
`gates::kind_str` and the TUI's `match kind` are the clearest tell — we have an enum on each side of the wire and a string in the middle. That string is the only schema, so adding a gate kind is a four-place edit with no compile-time enforcement. |
| 27 |
|
| 28 |
The same anti-pattern with higher stakes is `detail`. Three concrete examples from `gates.rs`: |
| 29 |
|
| 30 |
- `migration_dry_run` writes `"scratch_db_url unset in daemon config"` (a config bug, no point retrying). |
| 31 |
- `migration_dry_run` writes `"no backup fetched; call /backup/fetch first"` (a missing prerequisite, retry after `/backup/fetch`). |
| 32 |
- `boot_smoke` writes `"binary exited early: exit status: 101\n==== stdout ====\n…"` (a real failure carrying a tail). |
| 33 |
- `burn_in` writes `"47 hours remaining of 168"` (not a failure at all — the gate is correctly red but the deploy is not blocked by a defect). |
| 34 |
|
| 35 |
Today, an operator (or the TUI, or a future alerting rule) cannot tell those apart without substring-matching the prose. The point of the redesign is to make that classification a property of the value, not a property of how it's spelled. |
| 36 |
|
| 37 |
## The type graph |
| 38 |
|
| 39 |
Domain types live in a new `daemon/src/domain.rs` (name negotiable). They are the vocabulary every other module speaks; they implement `Serialize`/`Deserialize`, `Display`, `FromStr`, and `sqlx::Type` so they round-trip through events, JSON responses, and SQLite columns without per-site conversion. |
| 40 |
|
| 41 |
```text |
| 42 |
┌──────────────┐ |
| 43 |
│ TierId │ newtype String, validated against topology on construction |
| 44 |
└──────────────┘ |
| 45 |
┌──────────────┐ |
| 46 |
│ NodeId │ newtype String, validated against TierId's nodes |
| 47 |
└──────────────┘ |
| 48 |
┌──────────────┐ |
| 49 |
│ Version │ parsed semver (semver::Version), Display = "0.8.12" |
| 50 |
└──────────────┘ |
| 51 |
┌──────────────┐ |
| 52 |
│ GitSha │ enforced 40-hex or short-7-hex; one canonical form |
| 53 |
└──────────────┘ |
| 54 |
┌──────────────┐ |
| 55 |
│ GateKind │ enum, replaces both `Gate` discriminant and `gate_kind` string |
| 56 |
│ │ Variants: CargoTest, MigrationDryRun, BootSmoke, |
| 57 |
│ │ BurnIn, ManualConfirm. Display = snake_case. |
| 58 |
└──────────────┘ |
| 59 |
┌──────────────┐ |
| 60 |
│ GateRunId │ newtype i64, identifies one row of gate_runs |
| 61 |
└──────────────┘ |
| 62 |
┌──────────────┐ |
| 63 |
│ DeployId │ newtype i64 |
| 64 |
└──────────────┘ |
| 65 |
``` |
| 66 |
|
| 67 |
`Gate` (the topology struct with `BurnIn { hours }`) and `GateKind` (the discriminant) become distinct: `Gate` is the *config* (kind + parameters), `GateKind` is the *identifier* you use to talk about a class of gate in events and the schema. `gate_runs.gate_kind` stores `GateKind`; gate parameters at the time of the run, if we ever need them in history, become a separate column. |
| 68 |
|
| 69 |
### `GateOutcome` redesign |
| 70 |
|
| 71 |
```rust |
| 72 |
pub struct GateOutcome { |
| 73 |
pub status: GateStatus, |
| 74 |
pub log_ref: Option<LogRef>, // pointer to persisted stdout/stderr, not the tail itself |
| 75 |
} |
| 76 |
|
| 77 |
pub enum GateStatus { |
| 78 |
Passed { note: PassNote }, |
| 79 |
Failed(GateFailure), |
| 80 |
Blocked(GateBlocker), // gate cannot run yet; not a defect |
| 81 |
} |
| 82 |
|
| 83 |
pub enum PassNote { |
| 84 |
StayedUp { duration_s: u32 }, // boot_smoke |
| 85 |
BurnInElapsed { hours: u32 }, // burn_in |
| 86 |
Migrated { backup: BackupId, count: u32 }, // migration_dry_run |
| 87 |
TestsPassed { count: u32, duration_s: u32 }, // cargo_test |
| 88 |
OperatorConfirmed { at: DateTime<Utc> }, // manual_confirm |
| 89 |
} |
| 90 |
|
| 91 |
pub enum GateBlocker { |
| 92 |
BurnInClockNotStarted, |
| 93 |
BurnInRemaining { hours: u32, total: u32 }, |
| 94 |
AwaitingOperatorConfirmation, |
| 95 |
NoBackupAvailable, |
| 96 |
ScratchDbUrlUnset, |
| 97 |
ArtifactMissing { version: Version }, |
| 98 |
} |
| 99 |
|
| 100 |
pub enum GateFailure { |
| 101 |
CargoTest { failed_count: u32, first_failed: Option<String> }, |
| 102 |
MigrationDrift { migration: String }, // "previously applied but missing" |
| 103 |
MigrationModified { migration: String }, // "previously applied but modified" |
| 104 |
MigrationSqlError { migration: String, sqlstate: Option<String> }, |
| 105 |
RestoreFailed { kind: RestoreFailureKind }, |
| 106 |
BootPanic { exit_code: Option<i32> }, |
| 107 |
BootExitedEarly { exit_code: Option<i32> }, |
| 108 |
SpawnFailed { os_error: i32 }, |
| 109 |
Timeout { gate: GateKind, after_s: u32 }, |
| 110 |
Unclassified, // fallthrough; log_ref required |
| 111 |
} |
| 112 |
``` |
| 113 |
|
| 114 |
Three things to notice about this shape: |
| 115 |
|
| 116 |
1. **`Blocked` is its own variant.** Burn-in not yet elapsed and "scratch_db_url unset" are not failures — they are pre-conditions the operator can address out-of-band. The TUI can render them yellow instead of red. Today they're red, indistinguishably. |
| 117 |
2. **`log_ref`, not `detail`.** The structured variants carry just enough to render a one-line summary (`migration 0047 modified`, `tests failed: 3`). The full tail lives on disk and the variant carries a pointer to it. This is the architectural seam between Phase B (classification) and Phase C (live tail) — once `log_ref` exists, Phase C is "ship chunks of that log over WS as they're written" rather than a redesign. |
| 118 |
3. **`Unclassified` is admitted.** Classifiers are best-effort and the migration plan must work when a new failure mode shows up that no classifier matches yet. Unclassified failures still have a `log_ref`; they degrade gracefully to the current "read the tail" experience without breaking the contract. |
| 119 |
|
| 120 |
### Classifier layer |
| 121 |
|
| 122 |
Each gate's runner produces raw output (`stdout`, `stderr`, exit status, plus any structured signals like a sqlx error). A classifier maps `(GateKind, RawOutput) -> GateStatus`. Classifiers are pure functions and live in `daemon/src/classify/` — one file per gate. The taxonomy in `plans/migration-dryrun-failures.md` is already 80% of `migration_dry_run`'s classifier; it gets ported into code, not just docs. |
| 123 |
|
| 124 |
Classifiers can be unit-tested with captured fixtures (`tests/fixtures/cargo_test_failed_compile.txt` → `GateStatus::Failed(GateFailure::CargoTest { failed_count: 0, .. })`). This is the first place Sando gets meaningful test coverage for diagnostic behavior, which `todo.md` flags as a gap. |
| 125 |
|
| 126 |
### Where each type enters Sando |
| 127 |
|
| 128 |
Boundary discipline: every newtype is constructed once, at the edge where its string form enters the process. Internally, only the typed form moves. |
| 129 |
|
| 130 |
- `TierId`, `NodeId` — at `Topology::load`. The validator already walks tiers; it now mints `TierId`s and rejects unknown references at parse time instead of letting them surface as foreign-key failures. |
| 131 |
- `Version` — at the build step (`build.rs`), parsed from the server `Cargo.toml`. Stored as text in `versions.version`; the column round-trips through `sqlx::Type for Version`. |
| 132 |
- `GitSha` — at the post-receive hook entry point in `routes.rs::rebuild`. The hook today passes a string; the route normalizes to `GitSha` immediately. The `/logs/<sha>/<gate>` route (launchplan §3 wording) becomes `/logs/<GitSha>/<GateKind>`, but on-disk storage continues to key by `Version` until the build step has both available together. The comment at `gates.rs:432-434` already anticipates this transition. |
| 133 |
- `GateKind` — at `Topology` deserialization. `Gate` keeps its parameter-carrying variants; `GateKind` is derived from `Gate` and used everywhere else. |
| 134 |
- `GateOutcome` — produced by gate runners, persisted, emitted in events, displayed in TUI. Never lowered to a `String` along the way. |
| 135 |
|
| 136 |
## Persistence |
| 137 |
|
| 138 |
`gate_runs` becomes: |
| 139 |
|
| 140 |
```sql |
| 141 |
CREATE TABLE gate_runs ( |
| 142 |
id INTEGER PRIMARY KEY AUTOINCREMENT, |
| 143 |
version TEXT NOT NULL REFERENCES versions(version), |
| 144 |
tier TEXT NOT NULL REFERENCES tiers(name), |
| 145 |
gate_kind TEXT NOT NULL, -- GateKind, Display form |
| 146 |
started_at TEXT NOT NULL, |
| 147 |
finished_at TEXT, |
| 148 |
status TEXT, -- 'passed' | 'failed' | 'blocked' | NULL while in-flight |
| 149 |
outcome_json TEXT, -- serialized GateOutcome (PassNote / GateFailure / GateBlocker) |
| 150 |
log_ref TEXT -- relative path under cfg.logs_root, or NULL |
| 151 |
); |
| 152 |
``` |
| 153 |
|
| 154 |
`status` is denormalized for cheap indexing and `WHERE status = 'failed'` queries. `outcome_json` is the source of truth and is what the daemon reads back when serving `/state`. The migration drops `passed INTEGER` and `detail TEXT`, with a backfill that maps: |
| 155 |
|
| 156 |
- `passed = 1` → `status = 'passed'`, `outcome_json = {"kind":"passed","note":{"kind":"legacy","text":<old detail>}}` |
| 157 |
- `passed = 0`, detail matches a known prefix (`"burn-in"`, `"scratch_db_url unset"`, `"no backup fetched"`, `"waiting on operator"`) → `status = 'blocked'`, appropriate `GateBlocker` variant |
| 158 |
- `passed = 0` otherwise → `status = 'failed'`, `outcome_json = {"kind":"failed","failure":{"kind":"unclassified","legacy_detail":<old>}}` |
| 159 |
|
| 160 |
Backfill correctness is testable against the existing prod sandod sqlite — there are not many rows. |
| 161 |
|
| 162 |
`deploys.outcome` gets the same treatment in a smaller, separate migration: enum-typed (`InProgress | Ok | Failed { kind: DeployFailureKind }`), with the failure kind classifying spawn/transport/health-check distinctly from the freeform `error: String` carried by `Event::DeployFailed` today. |
| 163 |
|
| 164 |
## Wire shape (events + `/state`) |
| 165 |
|
| 166 |
`Event::GateDone` becomes: |
| 167 |
|
| 168 |
```rust |
| 169 |
GateDone { |
| 170 |
tier: TierId, |
| 171 |
version: Version, |
| 172 |
gate: GateKind, |
| 173 |
outcome: GateOutcome, |
| 174 |
} |
| 175 |
``` |
| 176 |
|
| 177 |
The old `passed: bool` is gone — `outcome.status` carries strictly more information. The TUI's `format_event` becomes a match on the typed envelope (deserialized via `serde_json::from_str::<EventEnvelope>`, not `Value` reflection), and the per-kind render functions in `tui/src/main.rs:186-266` collapse into `Display` impls on the domain types. The `str_v` / `num_v` helpers stop being needed. |
| 178 |
|
| 179 |
`/state` (the TUI's polling endpoint) gains a `gates: Vec<GateRunSummary>` per tier, where `GateRunSummary` is the typed outcome plus timestamps and a `log_ref`. The TUI no longer needs `/logs/...` to populate its primary view; `/logs` becomes the drill-down for unclassified failures only. |
| 180 |
|
| 181 |
Phase C's live tail rides on the same `log_ref`: while a gate is in flight, `log_ref` is present and the WS emits `GateLogChunk { run_id: GateRunId, bytes: Vec<u8>, seq: u32 }` events as the runner flushes to disk. The TUI keeps a per-run ring buffer keyed by `GateRunId`. Because `GateRunId` is a newtype with a clear identity, the wire protocol does not need to invent a separate stream identifier. |
| 182 |
|
| 183 |
## Migration order |
| 184 |
|
| 185 |
The order is constrained by what compiles together, not by the launchplan's B-then-C framing. |
| 186 |
|
| 187 |
1. **Domain types module** — introduce `TierId`, `NodeId`, `Version`, `GitSha`, `GateKind`, `GateRunId`, `DeployId` with `Serialize` / `sqlx::Type` impls. No call sites change yet. Pure addition. |
| 188 |
2. **Topology + config use the types** — `Topology::load` mints `TierId` / `NodeId`; `GateCtx` carries `TierId`, `Version`, not strings. `gate_runs.tier` and `gate_runs.version` columns stay `TEXT` (no schema migration); the change is on the Rust side only. `kind_str` deletes. |
| 189 |
3. **Events use the types** — every `Event` variant takes the newtypes. WS frames are unchanged on the wire (snake-case strings) because `Serialize` impls produce identical JSON. The TUI keeps its `Value` reflection during this step; nothing forces it to break. |
| 190 |
4. **`GateOutcome` redesign + classifier layer + `gate_runs` migration** — the biggest single step. Runners produce typed outcomes; classifiers live in `daemon/src/classify/`; schema migration 003 lands with backfill. `/state` exposes typed outcomes. TUI keeps `Value` reflection but is now reading typed fields (`status`, `outcome_json`). |
| 191 |
5. **TUI typed event handling** — deserialize `EventEnvelope` directly, drop `str_v` / `num_v`, render via `Display` on domain types. Yellow for `Blocked`, red for `Failed`, green for `Passed`. |
| 192 |
6. **Live tail (`GateLogChunk` events, `log_ref` plumbing)** — gates write to a per-run log file as they run, not at completion; runners drop the `Vec<u8>` buffers in `boot_smoke` / `cargo_test` in favor of streaming through a `tokio::fs::File` writer that also broadcasts chunks. TUI consumes. |
| 193 |
7. **Deploy outcome typing + `DeployFailed` failure-kind classifier** — same pattern, smaller scope. Closes out the rest of the string surface. |
| 194 |
|
| 195 |
Steps 1–3 can ship in one commit per step without changing observable behavior. Step 4 is where the operator visibility actually changes; it's the smallest unit that justifies a version bump and a sando deploy to itself. |
| 196 |
|
| 197 |
## Non-goals and open questions |
| 198 |
|
| 199 |
- **Not aiming for a generic gate framework.** The `GateFailure` variants are MNW-specific (cargo, sqlx migrations, `makenotwork` binary boot). If we ever gate non-MNW projects, this enum will need to factor; until then, concrete variants are clearer than `Box<dyn>` traits. |
| 200 |
- **Not introducing `Result<GateOutcome, GateError>`.** Runner-internal failures (sqlite I/O, can't spawn) collapse into `GateStatus::Failed(GateFailure::SpawnFailed | …)` or `GateBlocker` as appropriate. The outer `Result` only exists for genuine "the daemon itself is broken" cases that prevent persistence at all. |
| 201 |
- **Open: backfill aggressiveness.** Should we run classifiers retroactively against the historical `detail` strings during the schema-3 backfill, or only forward? Forward-only is simpler and the history isn't operator-load-bearing — leaning toward forward-only with the `unclassified + legacy_detail` envelope above, but worth deciding. |
| 202 |
- **Open: where does `GateRunId` originate?** Today the `INSERT … RETURNING id` returns a raw i64 inside `gates::run`. Cleanest is for `gates::run` to mint a `GateRunId` and pass it back; the event bus then carries it from `GateStart` through `GateLogChunk` through `GateDone`, making client-side correlation trivial. Worth confirming this doesn't break the `manual_confirm` lookup pattern in `gates.rs:386-403`. |
| 203 |
- **Open: `serde(tag = "kind")` vs explicit variant shape on `GateOutcome`.** The events module already uses `#[serde(tag = "kind", rename_all = "snake_case")]`, which would give us `{"kind":"failed","failure":{"kind":"migration_drift","migration":"0047"}}`. Two layers of tagged unions is verbose but consistent. Alternative: flatten via `#[serde(flatten)]` and accept that the JSON shape diverges slightly from the Rust shape. Leaning toward the explicit two-layer form because the TUI's parser then mirrors the Rust enum 1:1. |
| 204 |
- **Open: do we keep `detail TEXT` for one release as a shadow column, or drop in the same migration?** Shadow is safer for rollback. Adds clutter. Probably worth one release of shadow. |
| 205 |
|
| 206 |
## Acceptance, when this is done |
| 207 |
|
| 208 |
- `gates.rs` contains no `String` literals describing failure modes. Every failure path constructs a `GateFailure` variant. |
| 209 |
- The TUI's `format_event` is gone; rendering goes through `Display` on domain types. |
| 210 |
- An operator hitting `c /api/state` sees structured outcomes (not just `"passed": false, "detail": "..."`). |
| 211 |
- A new gate kind is added by extending `GateKind`, `Gate`, and a classifier; the compiler enforces every other site. |
| 212 |
- `/logs/<version>/<gate>` is a fallback for `Unclassified` outcomes, not the primary diagnostic path. |
| 213 |
- Live tail works for `boot_smoke` and `cargo_test`; the buffered `Vec<u8>` pattern in `gates.rs:317-332` is gone. |
| 214 |
|