Product & Program

The same system, seen from above. How I frame it as a product and run it as a program, with the limits named.

The other two writeups are about building the system. This one is about framing and running it: who it is for, how success is measured, how the work is sequenced, and what could go wrong. One thing stated plainly upfront, because it is the first thing a sharp reader will think: this is a solo build, so the program layer is simulated. There is no real cross-team coordination to manage. The workstreams, dependencies, sequencing, and risk reasoning are real and transferable. The org around them is not. Naming that is more credible than pretending otherwise.

The problem (product framing)

Defense developments are scattered across many sources and arrive faster than one person can triage. An analyst spends a disproportionate share of time finding and sorting, judging relevance and assigning a category, before any real analysis begins. No lightweight, personal system ingests the firehose, classifies it by topic, and makes the result queryable, so the analyst starts from a raw feed instead of a triaged knowledge base.

Who it is for

Primary user: a defense-news analyst who must stay current across programs, contracts, geopolitics, and technology, and brief others on what matters. This is a notional persona, and defense is a deliberately chosen vertical, not a claim of domain expertise. Design partner: me, dogfooding, as the first user and the early validation loop.

Job to be done

When a stream of defense news comes in, help me quickly know what matters and where it fits, so I spend my time analyzing instead of sorting.

Success metrics

Split deliberately into two kinds, because they answer different questions.

User-outcome metrics (does it help the analyst?)

Auto-triage rate: share of incoming items correctly categorized automatically, so the analyst never hand-sorts them.
Coverage (recall): share of genuinely relevant developments surfaced rather than missed.
Answer usefulness: for a query, the agent returns a correct, cited answer, fast.

AI-quality metrics (can it be measured, not just demoed?)

Classification quality: per-field precision, recall, F1. On v2 real, human-labeled text, category is 88.9% (macro-F1 0.906) and operational domain 88.9% (macro-F1 0.894). The ceiling is label ambiguity, not model horsepower.
Retrieval quality (RAG): recall@k, answer groundedness, citation correctness.
Regression gate: an eval pass-rate threshold wired into CI, so a change that drops below it does not merge.

Roadmap (program framing)

Sequenced Now / Next / Later, harvested from real delivery rather than wishful planning.

Now

Product one-pager and this program view
Agent tool-layer contract (SYS-003), accepted and implemented
Evals-as-CI running across all three code repos

Event loop closed: publish, consume, idempotent writeback
Integration tests on both seams (Testcontainers-Kafka)
A weekly status cadence from real progress

Later

Containerize, then Kafka on K8s (Strimzi)
OpenTelemetry observability across services
Other verticals: an explicit non-goal, articulated not built

Risk register

Risk	Severity	Mitigation
Duplicate event processing (at-least-once redelivery)	Low	Idempotent namespaced-tag writeback; commit the offset only after success
Classifier accuracy ceiling (label ambiguity, not the model)	Medium	Do not escalate the model; refine the taxonomy or judge boundary cases
Breadth creep (verticals without depth)	Medium	Deep on one vehicle; other verticals are an explicit non-goal
Planning theater (docs drift from delivery)	Medium	Keep artifacts thin and living, tied to real delivery only
Silent contract drift (separate repos)	Medium	Frozen wire contracts plus contract tests on both sides in CI
Simulated program (no real cross-team coordination)	Low (honesty)	Stated plainly; the reasoning and artifacts are real, the org is not

What it demonstrates

The ability to put the product and program hats on over real engineering work: to name the user and the job, to measure outcomes in two registers, to sequence dependent workstreams, and to track risk honestly, including the risk of the exercise itself. The scale is personal and the program is simulated, and both are named on purpose. The source documents live in the architecture repo: the product one-pager and the program view.