essay 2026-03-24

The 870-Token Fix

A text file that cost nothing beat a model upgrade that cost four times as much

Read

AI Collaboration

A categorization tool sorts incoming ideas into seven custom categories. The cheap model dumps half of them into “miscellaneous.” The obvious fix: upgrade to a model four times the price. The actual fix: an 870-token text file. Zero dollars. Thirty minutes.

The problem

The app takes messy, voice-transcribed ideas and assigns each to one of seven domain-specific categories. Not generic labels like “work” or “personal” — buckets with names that only make sense if you know the system. Categories like “content pipeline” or “infrastructure” where the boundaries are obvious to me but opaque to a model seeing them cold.

Four models, twenty labeled ideas, bare category names only:

Model	Match	Failure mode
Sonnet	80%	Reasonable disagreements
Maverick	70%	One real error
Qwen 3.5+	60%	Dumps to misc
Haiku	50%	Aggressively dumps to misc

Haiku’s response, over and over:

Category: miscellaneous. I can’t confidently map this to any of the specific categories. The terms used don’t clearly align with the available options.

— Haiku, without context file

It didn’t misunderstand the ideas or hallucinate categories. It just couldn’t map unfamiliar domain terms to unfamiliar category names, so it gave up.

The naive conclusion: needs a bigger model. Wrong.

The fix

Instead of upgrading, I wrote a context file. 870 tokens. Three things:

Richer descriptions. Two to three lines per category instead of one. Enough specificity to distinguish “this belongs here” from “this could go anywhere.”
Domain glossary. Twenty-five terms mapped to correct categories. Project names, tool names, abbreviations that appear constantly but mean nothing without context.
Transcription quirk notes. Voice transcription garbles proper nouns predictably. “11 laps” means ElevenLabs. I know this. The model doesn’t — until you tell it.

The framing that always works: what would I tell a competent human doing this for the first time?

The results

Same four models, same twenty ideas, 870-token context file prepended:

Model + context	Accuracy	Change
Sonnet	~95%	Still best, gap shrank dramatically
Haiku	~80%	From 50% to 80% — misc dumping eliminated
Maverick	~80%	From 70% to 80%
Qwen	~80%	From 60% to 80%

Haiku went from worst to tied-for-second. The spread collapsed from thirty points to fifteen. Three models converged at the same level.

Same model, same idea, with the context file:

Category: content pipeline. This mentions PärPod episode generation and TTS rendering, which falls squarely within the content production workflow.

— Haiku, with 870-token glossary

The context didn’t make the models smarter. It made the task easier.

The twist

Four to five of the twenty “ground truth” labels were wrong. They’d been assigned by Sonnet in an earlier run — without the context file. Same domain-knowledge mistakes the cheaper models made later.

With context, all four models agreed on the correct classification for those items, contradicting the stored labels. The context didn’t just help cheap models catch up. It exposed errors the expensive model had made flying blind.

The ground truth was contaminated by the same gap. When you evaluate against labels generated without context, you’re measuring which model best replicates the original model’s mistakes.

When model size still matters

The 870-token file isn’t universal. The remaining fifteen-point gap between Haiku and Sonnet is real:

Creative generation — voice, style, humor. Smaller models plateau regardless of context.
Ambiguity reasoning — genuine uncertainty, not just term mapping.
Long-tail edge cases — novel inputs that need broad reasoning.

For classification, the gap often doesn’t matter. 80% at a quarter of the cost is the right trade when the remaining 20% can be caught by a confidence threshold.

Before upgrading the model, upgrade the prompt. Not with clever tricks — just with the domain knowledge the task requires. Write down what you know that the model doesn’t.

870 tokens. Thirty minutes. Zero dollars. Always try it first.