Defense News Classifier

An LLM classifier, and the eval harness that kept me honest about it.

This labels defense-news text by category (procurement, operations, policy, technology, industry) and operational domain (air, land, sea, cyber, space, multi). That part is a single model call. The interesting part is the eval harness, because more than once it told me I was wrong, and the only correct response was to believe it over my own intuition.

What the eval caught

Two moments are the whole point of building an eval at all.

The prompt that read better and performed worse

I sharpened the system prompt to draw a cleaner line between procurement (a purchase, the buyer's view) and industry (a company's own business news). It read better to a human. It measured worse: category accuracy fell from 79.0% to 76.7%, and industry recall dropped from 0.217 to 0.100. The sharper rule gave the model an even cleaner excuse to dump borderline company stories into procurement. I reverted it.

The lesson: a change that sounds right can move the decision boundary the wrong way, and intuition cannot see that. Only the measurement can.

The grounding that did not pay

I added BM25 lexical retrieval to ground each classification in similar documents, expecting a clear win. Measured lift: +1.9% category accuracy, +0.0% on domain. A flip analysis showed grounding fixed about as many calls as it broke. The conclusion wrote itself: lexical retrieval does not earn the cost of moving to embeddings here.

The lesson: a negative result is still a result. Measuring first and escalating only when the eval says it pays is the discipline, and here it said do not bother.

Why the numbers can be trusted

The eval went through two versions, and the gap between them is the honest part. v1 graded the model on 300 snippets it generated itself. Those numbers measure consistency, not correctness, and they hid a real hole: industry recall was 0.217, the model catching one in five. v2 replaced that with 54 hand-labeled real snippets (DoD news wire plus SEC filings), cross-checked by an Opus judge. On real financial text with actual vocabulary, industry went to an F1 of 1.000. Circular evals catch whether a model is consistent. Only real, human-labeled data catches whether it is right.

Metric	v1 synthetic	v2 real	v2 grounded
Category accuracy	79.0%	88.9%	90.7%
Category macro-F1	0.765	0.906	0.914
Domain accuracy	97.3%	88.9%	88.9%
Domain macro-F1	0.973	0.894	0.907

The domain numbers drop from v1 to v2, and that is the point. Synthetic text the model wrote was easy to grade. Real text is harder and the lower number is the trustworthy one.

The decisions

Decision · ADR-002

Force structured output with tool-use, not prompt parsing

The model is called with a forced tool schema (both fields required, strict enums), so an invalid label is rejected at the API layer before it reaches application code.

Why: the schema becomes the contract, and the app code stays free of defensive parsing. Tradeoff: tool-use costs a few extra tokens and a more verbose call signature than plain chat. Out-of-enum responses were rare (once in 300) and a single re-sample handled them.

Decision

One model call, not a multi-step pipeline

Classification is a single call, not a generate-reasoning-then-classify chain.

Why: the goal was to measure classifier quality honestly, and a simpler baseline makes the eval cleaner to reason about. Tradeoff: gives up any lift a reasoning step might add, which the eval can always be used to revisit later.

Decision · ADR-004

Implement the metrics by hand

Precision, recall, F1, and confusion matrices are computed directly with pandas and arithmetic, no ML framework.

Why: on a project about building AI systems, writing the metrics by hand is a statement that I understand what I am measuring, and it keeps the dependency surface to anthropic plus pandas. Tradeoff: about 40 lines that sklearn.metrics would replace, accepted for clarity.

Decision · SYS-005 (system-level)

In-process, idempotent writeback to the notes API

As a component of the larger system, the classifier's Kafka consumer calls classify() in-process and writes labels back as namespaced tags with replace semantics, so reprocessing the same event converges instead of piling up duplicate tags.

Why: Kafka's at-least-once delivery makes reprocessing certain, so idempotency is a requirement, not a nicety. Tradeoff: the consumer carries real retry and poison-message logic. See The System for the full loop.

Stack

Python, the Anthropic SDK (Sonnet for classification, Opus as the v2 judge), pandas for the eval tables, FastAPI for the HTTP endpoint, kafka-python for the consumer, and rank-bm25 for the grounding experiment. Tested with pytest on mocked calls, plus an opt-in testcontainers integration test that spins a real Kafka broker.

What it demonstrates

The dataset is small and the scale is personal. The transferable part is the method: evals are the unit tests of an LLM system, a negative result is a finding, and measurement outranks intuition every time the two disagree. That is the habit that matters when the model, the data, or the stakes get bigger.