Skip to main content

max / makenotwork

18.1 KB · 214 lines History Blame Raw
1 # Sando observability architecture
2
3 Status: draft, 2026-06-03. Argument shape, not a checklist.
4
5 Goal: an error and visibility surface built on newtypes. Sando's current daemon and TUI agree by string convention — `tier: String`, `version: String`, `gate: String`, `detail: Option<String>` — and every consumer (schema, WS payloads, TUI render) reparses or pattern-matches those strings independently. The result is that gate failure classification only exists as prose in `gate_runs.detail`, and the TUI's only recourse on red is "open `/logs/<version>/<gate>` and read the tail." This document proposes the type graph that replaces the strings, the boundary parses that mint the types, and the persistence + wire shapes that follow.
6
7 This is one cohesive design. The launchplan splits it into "Phase B (typed GateOutcome)" and "Phase C (live tail + remaining newtypes)"; that split is a delivery convenience, not an architectural seam. The migration order below preserves the option to ship Phase B first, but the types it introduces must already be the final shape, not a string stand-in to be newtyped later.
8
9 ## What is currently stringly-typed
10
11 Inventory from a read of `daemon/src/gates.rs`, `daemon/src/events.rs`, `daemon/src/topology.rs`, `daemon/migrations/001_init.sql`, and `tui/src/main.rs:186-266`:
12
13 | Concept | Current shape | Where it appears |
14 |---|---|---|
15 | Tier | `tier: String` | `GateCtx`, every `Event` variant, `tiers.name`, `gate_runs.tier`, `tier_state.tier`, `deploys.tier`, TUI `str_v(v.get("tier"))` |
16 | Node | `node: String` (sometimes `Option`) | `Tier.nodes[].name`, `deploys.node`, `Event::Deploy*` |
17 | Version | `version: String` (semver `0.8.12`) | `GateCtx`, `versions.version`, every event, TUI display |
18 | Git sha | `sha: String` (short or full?) | `versions.git_sha`, `Event::RebuildRequested/BuildStart/BuildOk/BuildFailed/BuildAborted`, `/logs/<sha>/<gate>` per launchplan §3 |
19 | Gate kind | `Gate` enum *inside the daemon*; `gate_kind: &'static str` once it crosses the events/schema boundary; reparsed in TUI | `gates::kind_str`, `gate_runs.gate_kind`, `Event::Gate*.gate`, TUI `match kind` |
20 | Gate outcome detail | `detail: Option<String>` | `GateOutcome.detail`, `gate_runs.detail`, conflates: config error, missing prerequisite, real failure with stderr tail, burn-in clock progress, "waiting on operator" |
21 | Deploy outcome | `outcome TEXT DEFAULT 'in_progress'` | `deploys.outcome`; no enum on the Rust side |
22 | Deploy failure | `error: String` | `Event::DeployFailed.error`; freeform |
23 | Backup source | `source: String` | `backups.source`, `Event::BackupFetched.source`; an ssh URL today |
24 | Canary policy | typed (`CanaryPolicy`) at config load, then `as_str()` to a `TEXT` column | `tiers.canary` |
25
26 `gates::kind_str` and the TUI's `match kind` are the clearest tell — we have an enum on each side of the wire and a string in the middle. That string is the only schema, so adding a gate kind is a four-place edit with no compile-time enforcement.
27
28 The same anti-pattern with higher stakes is `detail`. Three concrete examples from `gates.rs`:
29
30 - `migration_dry_run` writes `"scratch_db_url unset in daemon config"` (a config bug, no point retrying).
31 - `migration_dry_run` writes `"no backup fetched; call /backup/fetch first"` (a missing prerequisite, retry after `/backup/fetch`).
32 - `boot_smoke` writes `"binary exited early: exit status: 101\n==== stdout ====\n…"` (a real failure carrying a tail).
33 - `burn_in` writes `"47 hours remaining of 168"` (not a failure at all — the gate is correctly red but the deploy is not blocked by a defect).
34
35 Today, an operator (or the TUI, or a future alerting rule) cannot tell those apart without substring-matching the prose. The point of the redesign is to make that classification a property of the value, not a property of how it's spelled.
36
37 ## The type graph
38
39 Domain types live in a new `daemon/src/domain.rs` (name negotiable). They are the vocabulary every other module speaks; they implement `Serialize`/`Deserialize`, `Display`, `FromStr`, and `sqlx::Type` so they round-trip through events, JSON responses, and SQLite columns without per-site conversion.
40
41 ```text
42 ┌──────────────┐
43 │ TierId │ newtype String, validated against topology on construction
44 └──────────────┘
45 ┌──────────────┐
46 │ NodeId │ newtype String, validated against TierId's nodes
47 └──────────────┘
48 ┌──────────────┐
49 │ Version │ parsed semver (semver::Version), Display = "0.8.12"
50 └──────────────┘
51 ┌──────────────┐
52 │ GitSha │ enforced 40-hex or short-7-hex; one canonical form
53 └──────────────┘
54 ┌──────────────┐
55 │ GateKind │ enum, replaces both `Gate` discriminant and `gate_kind` string
56 │ │ Variants: CargoTest, MigrationDryRun, BootSmoke,
57 │ │ BurnIn, ManualConfirm. Display = snake_case.
58 └──────────────┘
59 ┌──────────────┐
60 │ GateRunId │ newtype i64, identifies one row of gate_runs
61 └──────────────┘
62 ┌──────────────┐
63 │ DeployId │ newtype i64
64 └──────────────┘
65 ```
66
67 `Gate` (the topology struct with `BurnIn { hours }`) and `GateKind` (the discriminant) become distinct: `Gate` is the *config* (kind + parameters), `GateKind` is the *identifier* you use to talk about a class of gate in events and the schema. `gate_runs.gate_kind` stores `GateKind`; gate parameters at the time of the run, if we ever need them in history, become a separate column.
68
69 ### `GateOutcome` redesign
70
71 ```rust
72 pub struct GateOutcome {
73 pub status: GateStatus,
74 pub log_ref: Option<LogRef>, // pointer to persisted stdout/stderr, not the tail itself
75 }
76
77 pub enum GateStatus {
78 Passed { note: PassNote },
79 Failed(GateFailure),
80 Blocked(GateBlocker), // gate cannot run yet; not a defect
81 }
82
83 pub enum PassNote {
84 StayedUp { duration_s: u32 }, // boot_smoke
85 BurnInElapsed { hours: u32 }, // burn_in
86 Migrated { backup: BackupId, count: u32 }, // migration_dry_run
87 TestsPassed { count: u32, duration_s: u32 }, // cargo_test
88 OperatorConfirmed { at: DateTime<Utc> }, // manual_confirm
89 }
90
91 pub enum GateBlocker {
92 BurnInClockNotStarted,
93 BurnInRemaining { hours: u32, total: u32 },
94 AwaitingOperatorConfirmation,
95 NoBackupAvailable,
96 ScratchDbUrlUnset,
97 ArtifactMissing { version: Version },
98 }
99
100 pub enum GateFailure {
101 CargoTest { failed_count: u32, first_failed: Option<String> },
102 MigrationDrift { migration: String }, // "previously applied but missing"
103 MigrationModified { migration: String }, // "previously applied but modified"
104 MigrationSqlError { migration: String, sqlstate: Option<String> },
105 RestoreFailed { kind: RestoreFailureKind },
106 BootPanic { exit_code: Option<i32> },
107 BootExitedEarly { exit_code: Option<i32> },
108 SpawnFailed { os_error: i32 },
109 Timeout { gate: GateKind, after_s: u32 },
110 Unclassified, // fallthrough; log_ref required
111 }
112 ```
113
114 Three things to notice about this shape:
115
116 1. **`Blocked` is its own variant.** Burn-in not yet elapsed and "scratch_db_url unset" are not failures — they are pre-conditions the operator can address out-of-band. The TUI can render them yellow instead of red. Today they're red, indistinguishably.
117 2. **`log_ref`, not `detail`.** The structured variants carry just enough to render a one-line summary (`migration 0047 modified`, `tests failed: 3`). The full tail lives on disk and the variant carries a pointer to it. This is the architectural seam between Phase B (classification) and Phase C (live tail) — once `log_ref` exists, Phase C is "ship chunks of that log over WS as they're written" rather than a redesign.
118 3. **`Unclassified` is admitted.** Classifiers are best-effort and the migration plan must work when a new failure mode shows up that no classifier matches yet. Unclassified failures still have a `log_ref`; they degrade gracefully to the current "read the tail" experience without breaking the contract.
119
120 ### Classifier layer
121
122 Each gate's runner produces raw output (`stdout`, `stderr`, exit status, plus any structured signals like a sqlx error). A classifier maps `(GateKind, RawOutput) -> GateStatus`. Classifiers are pure functions and live in `daemon/src/classify/` — one file per gate. The taxonomy in `plans/migration-dryrun-failures.md` is already 80% of `migration_dry_run`'s classifier; it gets ported into code, not just docs.
123
124 Classifiers can be unit-tested with captured fixtures (`tests/fixtures/cargo_test_failed_compile.txt``GateStatus::Failed(GateFailure::CargoTest { failed_count: 0, .. })`). This is the first place Sando gets meaningful test coverage for diagnostic behavior, which `todo.md` flags as a gap.
125
126 ### Where each type enters Sando
127
128 Boundary discipline: every newtype is constructed once, at the edge where its string form enters the process. Internally, only the typed form moves.
129
130 - `TierId`, `NodeId` — at `Topology::load`. The validator already walks tiers; it now mints `TierId`s and rejects unknown references at parse time instead of letting them surface as foreign-key failures.
131 - `Version` — at the build step (`build.rs`), parsed from the server `Cargo.toml`. Stored as text in `versions.version`; the column round-trips through `sqlx::Type for Version`.
132 - `GitSha` — at the post-receive hook entry point in `routes.rs::rebuild`. The hook today passes a string; the route normalizes to `GitSha` immediately. The `/logs/<sha>/<gate>` route (launchplan §3 wording) becomes `/logs/<GitSha>/<GateKind>`, but on-disk storage continues to key by `Version` until the build step has both available together. The comment at `gates.rs:432-434` already anticipates this transition.
133 - `GateKind` — at `Topology` deserialization. `Gate` keeps its parameter-carrying variants; `GateKind` is derived from `Gate` and used everywhere else.
134 - `GateOutcome` — produced by gate runners, persisted, emitted in events, displayed in TUI. Never lowered to a `String` along the way.
135
136 ## Persistence
137
138 `gate_runs` becomes:
139
140 ```sql
141 CREATE TABLE gate_runs (
142 id INTEGER PRIMARY KEY AUTOINCREMENT,
143 version TEXT NOT NULL REFERENCES versions(version),
144 tier TEXT NOT NULL REFERENCES tiers(name),
145 gate_kind TEXT NOT NULL, -- GateKind, Display form
146 started_at TEXT NOT NULL,
147 finished_at TEXT,
148 status TEXT, -- 'passed' | 'failed' | 'blocked' | NULL while in-flight
149 outcome_json TEXT, -- serialized GateOutcome (PassNote / GateFailure / GateBlocker)
150 log_ref TEXT -- relative path under cfg.logs_root, or NULL
151 );
152 ```
153
154 `status` is denormalized for cheap indexing and `WHERE status = 'failed'` queries. `outcome_json` is the source of truth and is what the daemon reads back when serving `/state`. The migration drops `passed INTEGER` and `detail TEXT`, with a backfill that maps:
155
156 - `passed = 1``status = 'passed'`, `outcome_json = {"kind":"passed","note":{"kind":"legacy","text":<old detail>}}`
157 - `passed = 0`, detail matches a known prefix (`"burn-in"`, `"scratch_db_url unset"`, `"no backup fetched"`, `"waiting on operator"`) → `status = 'blocked'`, appropriate `GateBlocker` variant
158 - `passed = 0` otherwise → `status = 'failed'`, `outcome_json = {"kind":"failed","failure":{"kind":"unclassified","legacy_detail":<old>}}`
159
160 Backfill correctness is testable against the existing prod sandod sqlite — there are not many rows.
161
162 `deploys.outcome` gets the same treatment in a smaller, separate migration: enum-typed (`InProgress | Ok | Failed { kind: DeployFailureKind }`), with the failure kind classifying spawn/transport/health-check distinctly from the freeform `error: String` carried by `Event::DeployFailed` today.
163
164 ## Wire shape (events + `/state`)
165
166 `Event::GateDone` becomes:
167
168 ```rust
169 GateDone {
170 tier: TierId,
171 version: Version,
172 gate: GateKind,
173 outcome: GateOutcome,
174 }
175 ```
176
177 The old `passed: bool` is gone — `outcome.status` carries strictly more information. The TUI's `format_event` becomes a match on the typed envelope (deserialized via `serde_json::from_str::<EventEnvelope>`, not `Value` reflection), and the per-kind render functions in `tui/src/main.rs:186-266` collapse into `Display` impls on the domain types. The `str_v` / `num_v` helpers stop being needed.
178
179 `/state` (the TUI's polling endpoint) gains a `gates: Vec<GateRunSummary>` per tier, where `GateRunSummary` is the typed outcome plus timestamps and a `log_ref`. The TUI no longer needs `/logs/...` to populate its primary view; `/logs` becomes the drill-down for unclassified failures only.
180
181 Phase C's live tail rides on the same `log_ref`: while a gate is in flight, `log_ref` is present and the WS emits `GateLogChunk { run_id: GateRunId, bytes: Vec<u8>, seq: u32 }` events as the runner flushes to disk. The TUI keeps a per-run ring buffer keyed by `GateRunId`. Because `GateRunId` is a newtype with a clear identity, the wire protocol does not need to invent a separate stream identifier.
182
183 ## Migration order
184
185 The order is constrained by what compiles together, not by the launchplan's B-then-C framing.
186
187 1. **Domain types module** — introduce `TierId`, `NodeId`, `Version`, `GitSha`, `GateKind`, `GateRunId`, `DeployId` with `Serialize` / `sqlx::Type` impls. No call sites change yet. Pure addition.
188 2. **Topology + config use the types**`Topology::load` mints `TierId` / `NodeId`; `GateCtx` carries `TierId`, `Version`, not strings. `gate_runs.tier` and `gate_runs.version` columns stay `TEXT` (no schema migration); the change is on the Rust side only. `kind_str` deletes.
189 3. **Events use the types** — every `Event` variant takes the newtypes. WS frames are unchanged on the wire (snake-case strings) because `Serialize` impls produce identical JSON. The TUI keeps its `Value` reflection during this step; nothing forces it to break.
190 4. **`GateOutcome` redesign + classifier layer + `gate_runs` migration** — the biggest single step. Runners produce typed outcomes; classifiers live in `daemon/src/classify/`; schema migration 003 lands with backfill. `/state` exposes typed outcomes. TUI keeps `Value` reflection but is now reading typed fields (`status`, `outcome_json`).
191 5. **TUI typed event handling** — deserialize `EventEnvelope` directly, drop `str_v` / `num_v`, render via `Display` on domain types. Yellow for `Blocked`, red for `Failed`, green for `Passed`.
192 6. **Live tail (`GateLogChunk` events, `log_ref` plumbing)** — gates write to a per-run log file as they run, not at completion; runners drop the `Vec<u8>` buffers in `boot_smoke` / `cargo_test` in favor of streaming through a `tokio::fs::File` writer that also broadcasts chunks. TUI consumes.
193 7. **Deploy outcome typing + `DeployFailed` failure-kind classifier** — same pattern, smaller scope. Closes out the rest of the string surface.
194
195 Steps 1–3 can ship in one commit per step without changing observable behavior. Step 4 is where the operator visibility actually changes; it's the smallest unit that justifies a version bump and a sando deploy to itself.
196
197 ## Non-goals and open questions
198
199 - **Not aiming for a generic gate framework.** The `GateFailure` variants are MNW-specific (cargo, sqlx migrations, `makenotwork` binary boot). If we ever gate non-MNW projects, this enum will need to factor; until then, concrete variants are clearer than `Box<dyn>` traits.
200 - **Not introducing `Result<GateOutcome, GateError>`.** Runner-internal failures (sqlite I/O, can't spawn) collapse into `GateStatus::Failed(GateFailure::SpawnFailed | …)` or `GateBlocker` as appropriate. The outer `Result` only exists for genuine "the daemon itself is broken" cases that prevent persistence at all.
201 - **Open: backfill aggressiveness.** Should we run classifiers retroactively against the historical `detail` strings during the schema-3 backfill, or only forward? Forward-only is simpler and the history isn't operator-load-bearing — leaning toward forward-only with the `unclassified + legacy_detail` envelope above, but worth deciding.
202 - **Open: where does `GateRunId` originate?** Today the `INSERT … RETURNING id` returns a raw i64 inside `gates::run`. Cleanest is for `gates::run` to mint a `GateRunId` and pass it back; the event bus then carries it from `GateStart` through `GateLogChunk` through `GateDone`, making client-side correlation trivial. Worth confirming this doesn't break the `manual_confirm` lookup pattern in `gates.rs:386-403`.
203 - **Open: `serde(tag = "kind")` vs explicit variant shape on `GateOutcome`.** The events module already uses `#[serde(tag = "kind", rename_all = "snake_case")]`, which would give us `{"kind":"failed","failure":{"kind":"migration_drift","migration":"0047"}}`. Two layers of tagged unions is verbose but consistent. Alternative: flatten via `#[serde(flatten)]` and accept that the JSON shape diverges slightly from the Rust shape. Leaning toward the explicit two-layer form because the TUI's parser then mirrors the Rust enum 1:1.
204 - **Open: do we keep `detail TEXT` for one release as a shadow column, or drop in the same migration?** Shadow is safer for rollback. Adds clutter. Probably worth one release of shadow.
205
206 ## Acceptance, when this is done
207
208 - `gates.rs` contains no `String` literals describing failure modes. Every failure path constructs a `GateFailure` variant.
209 - The TUI's `format_event` is gone; rendering goes through `Display` on domain types.
210 - An operator hitting `c /api/state` sees structured outcomes (not just `"passed": false, "detail": "..."`).
211 - A new gate kind is added by extending `GateKind`, `Gate`, and a classifier; the compiler enforces every other site.
212 - `/logs/<version>/<gate>` is a fallback for `Unclassified` outcomes, not the primary diagnostic path.
213 - Live tail works for `boot_smoke` and `cargo_test`; the buffered `Vec<u8>` pattern in `gates.rs:317-332` is gone.
214