Product & Program
The same system, seen from above. How I frame it as a product and run it as a program, with the limits named.
The other two writeups are about building the system. This one is about framing and running it: who it is for, how success is measured, how the work is sequenced, and what could go wrong. One thing stated plainly upfront, because it is the first thing a sharp reader will think: this is a solo build, so the program layer is simulated. There is no real cross-team coordination to manage. The workstreams, dependencies, sequencing, and risk reasoning are real and transferable. The org around them is not. Naming that is more credible than pretending otherwise.
The problem (product framing)
Defense developments are scattered across many sources and arrive faster than one person can triage. An analyst spends a disproportionate share of time finding and sorting, judging relevance and assigning a category, before any real analysis begins. No lightweight, personal system ingests the firehose, classifies it by topic, and makes the result queryable, so the analyst starts from a raw feed instead of a triaged knowledge base.
Who it is for
Primary user: a defense-news analyst who must stay current across programs, contracts, geopolitics, and technology, and brief others on what matters. This is a notional persona, and defense is a deliberately chosen vertical, not a claim of domain expertise. Design partner: me, dogfooding, as the first user and the early validation loop.
Job to be done
When a stream of defense news comes in, help me quickly know what matters and where it fits, so I spend my time analyzing instead of sorting.
Success metrics
Split deliberately into two kinds, because they answer different questions.
User-outcome metrics (does it help the analyst?)
- Auto-triage rate: share of incoming items correctly categorized automatically, so the analyst never hand-sorts them.
- Coverage (recall): share of genuinely relevant developments surfaced rather than missed.
- Answer usefulness: for a query, the agent returns a correct, cited answer, fast.
AI-quality metrics (can it be measured, not just demoed?)
- Classification quality: per-field precision, recall, F1. On v2 real, human-labeled text, category is 88.9% (macro-F1 0.906) and operational domain 88.9% (macro-F1 0.894). The ceiling is label ambiguity, not model horsepower.
- Retrieval quality (RAG): recall@k, answer groundedness, citation correctness.
- Regression gate: an eval pass-rate threshold wired into CI, so a change that drops below it does not merge.
Roadmap (program framing)
Sequenced Now / Next / Later, harvested from real delivery rather than wishful planning.
Now
- Product one-pager and this program view
- Agent tool-layer contract (SYS-003), accepted and implemented
- Evals-as-CI running across all three code repos
Next
- Event loop closed: publish, consume, idempotent writeback
- Integration tests on both seams (Testcontainers-Kafka)
- A weekly status cadence from real progress
Later
- Containerize, then Kafka on K8s (Strimzi)
- OpenTelemetry observability across services
- Other verticals: an explicit non-goal, articulated not built
Risk register
| Risk | Severity | Mitigation |
|---|---|---|
| Duplicate event processing (at-least-once redelivery) | Low | Idempotent namespaced-tag writeback; commit the offset only after success |
| Classifier accuracy ceiling (label ambiguity, not the model) | Medium | Do not escalate the model; refine the taxonomy or judge boundary cases |
| Breadth creep (verticals without depth) | Medium | Deep on one vehicle; other verticals are an explicit non-goal |
| Planning theater (docs drift from delivery) | Medium | Keep artifacts thin and living, tied to real delivery only |
| Silent contract drift (separate repos) | Medium | Frozen wire contracts plus contract tests on both sides in CI |
| Simulated program (no real cross-team coordination) | Low (honesty) | Stated plainly; the reasoning and artifacts are real, the org is not |
What it demonstrates
The ability to put the product and program hats on over real engineering work: to name the user and the job, to measure outcomes in two registers, to sequence dependent workstreams, and to track risk honestly, including the risk of the exercise itself. The scale is personal and the program is simulated, and both are named on purpose. The source documents live in the architecture repo: the product one-pager and the program view.