Defense News Classifier
An LLM classifier, and the eval harness that kept me honest about it.
This labels defense-news text by category (procurement, operations, policy, technology, industry) and operational domain (air, land, sea, cyber, space, multi). That part is a single model call. The interesting part is the eval harness, because more than once it told me I was wrong, and the only correct response was to believe it over my own intuition.
What the eval caught
Two moments are the whole point of building an eval at all.
The prompt that read better and performed worse
I sharpened the system prompt to draw a cleaner line between
procurement (a purchase, the buyer's view) and
industry (a company's own business news). It read better to a
human. It measured worse: category accuracy fell from 79.0% to 76.7%, and
industry recall dropped from 0.217 to 0.100. The sharper rule
gave the model an even cleaner excuse to dump borderline company stories
into procurement. I reverted it.
The lesson: a change that sounds right can move the decision boundary the wrong way, and intuition cannot see that. Only the measurement can.
The grounding that did not pay
I added BM25 lexical retrieval to ground each classification in similar documents, expecting a clear win. Measured lift: +1.9% category accuracy, +0.0% on domain. A flip analysis showed grounding fixed about as many calls as it broke. The conclusion wrote itself: lexical retrieval does not earn the cost of moving to embeddings here.
The lesson: a negative result is still a result. Measuring first and escalating only when the eval says it pays is the discipline, and here it said do not bother.
Why the numbers can be trusted
The eval went through two versions, and the gap between them is the honest
part. v1 graded the model on 300 snippets it generated itself. Those numbers
measure consistency, not correctness, and they hid a real hole:
industry recall was 0.217, the model catching one in five. v2
replaced that with 54 hand-labeled real snippets (DoD news wire plus SEC
filings), cross-checked by an Opus judge. On real financial text with actual
vocabulary, industry went to an F1 of 1.000. Circular evals catch
whether a model is consistent. Only real, human-labeled data catches whether
it is right.
| Metric | v1 synthetic | v2 real | v2 grounded |
|---|---|---|---|
| Category accuracy | 79.0% | 88.9% | 90.7% |
| Category macro-F1 | 0.765 | 0.906 | 0.914 |
| Domain accuracy | 97.3% | 88.9% | 88.9% |
| Domain macro-F1 | 0.973 | 0.894 | 0.907 |
The domain numbers drop from v1 to v2, and that is the point. Synthetic text the model wrote was easy to grade. Real text is harder and the lower number is the trustworthy one.
The decisions
Decision · ADR-002
Force structured output with tool-use, not prompt parsing
The model is called with a forced tool schema (both fields required, strict enums), so an invalid label is rejected at the API layer before it reaches application code.
Why: the schema becomes the contract, and the app code stays free of defensive parsing. Tradeoff: tool-use costs a few extra tokens and a more verbose call signature than plain chat. Out-of-enum responses were rare (once in 300) and a single re-sample handled them.
Decision
One model call, not a multi-step pipeline
Classification is a single call, not a generate-reasoning-then-classify chain.
Why: the goal was to measure classifier quality honestly, and a simpler baseline makes the eval cleaner to reason about. Tradeoff: gives up any lift a reasoning step might add, which the eval can always be used to revisit later.
Decision · ADR-004
Implement the metrics by hand
Precision, recall, F1, and confusion matrices are computed directly with pandas and arithmetic, no ML framework.
Why: on a project about building AI
systems, writing the metrics by hand is a statement that I understand what I
am measuring, and it keeps the dependency surface to anthropic
plus pandas. Tradeoff: about 40 lines that
sklearn.metrics would replace, accepted for clarity.
Decision · SYS-005 (system-level)
In-process, idempotent writeback to the notes API
As a component of the larger system, the classifier's Kafka consumer calls
classify() in-process and writes labels back as namespaced tags
with replace semantics, so reprocessing the same event converges instead of
piling up duplicate tags.
Why: Kafka's at-least-once delivery makes reprocessing certain, so idempotency is a requirement, not a nicety. Tradeoff: the consumer carries real retry and poison-message logic. See The System for the full loop.
Stack
Python, the Anthropic SDK (Sonnet for classification, Opus as the v2 judge), pandas for the eval tables, FastAPI for the HTTP endpoint, kafka-python for the consumer, and rank-bm25 for the grounding experiment. Tested with pytest on mocked calls, plus an opt-in testcontainers integration test that spins a real Kafka broker.
What it demonstrates
The dataset is small and the scale is personal. The transferable part is the method: evals are the unit tests of an LLM system, a negative result is a finding, and measurement outranks intuition every time the two disagree. That is the habit that matters when the model, the data, or the stakes get bigger.