experiment 2026-02-23

The Fallback Ladder

What actually happens when your AI goes down and you fall back to the cheap one

Read

AI Collaboration

Every application I build has the same insurance policy: if the main AI model is down, fall back to something cheaper. If that is down too, fall back to something running on the laptop. The theory is that a slightly worse answer delivered instantly beats a perfect answer that never arrives.

The theory had never been tested. I was running fallback chains in production based on the assumption that “the cheap model is fine for this.” The assumption felt right. Feelings are not data.

So I ran the test.

The setup

Five models. Four tasks. Same inputs, blind scoring by an independent judge. The models covered the full range: the expensive cloud option, the mid-tier option, a free cloud alternative I stumbled onto, and two local models running on the laptop.

The tasks were deliberately mundane — the kind of work these models actually do in my applications: extracting facts from a transcript, reviewing a draft, summarizing text in Swedish, and generating a short editorial in Swedish. Not benchmarks. Real work.

What I expected

A clear quality ladder. Expensive model at the top. Free model noticeably worse. Local models at the bottom but acceptable. A smooth gradient from good to adequate, with the premium justified by measurable quality differences.

What actually happened

The spread was embarrassingly small.

The best model scored 9.0 out of 10. The worst scored 8.2. That is a gap of 0.8 points on a 10-point scale. For three of the four tasks, the cheapest options were within one point of the most expensive one. The free cloud model — the one I had been ignoring because it was free and free things are supposed to be worse — tied with the paid models at 8.8.

The fallback chain I had been running on faith turned out to be running on fact. The local models were genuinely acceptable. The quality loss on failover was smaller than I would have noticed without the test.

There was one exception. Swedish generation — writing original prose from scratch. Same prompt, two models:

Kommunen har genomfört en omfattande omställning av infrastrukturen i centrum. Det nya avfallshanteringssystemet har tagits emot väl av invånarna, trots tidiga farhågor om kostnadsökningar.

— Sonnet, Swedish editorial task

Kommunen har genomfört en stor förändring av infrastrukturen i centrala delarna. Det nya systemet för avfallshantering har fått positiv mottagning bland invånarna, trots initiala farhågor om ökade kostnader.

— Qwen 7B (local), same task

The local model scored 7 out of 10 here. For everything else — summarization, extraction, editorial review — it was fine. The weakness was narrow: small models struggle to generate in languages underrepresented in training data. They can process those languages competently. Creation is harder than comprehension.

The uncomfortable implication

If the quality delta between a model that costs four cents per call and one that costs zero is less than one point on a ten-point scale, the expensive model is not earning its keep on routine tasks. It is paying for peace of mind, not measurably better output.

This does not mean expensive models are useless. The gap widens on creative tasks, ambiguous reasoning, and edge cases that require broad world knowledge. The glossary experiment showed that context injection closes part of the gap, but not all of it. Some tasks genuinely need the bigger model.

But for the plumbing — the classification, extraction, summarization, and formatting that constitutes ninety percent of API calls in a real application — the fallback is not a compromise. It is the same thing, for less money, often faster.

The revised chain

The experiment led to a concrete change in how the applications are wired:

The cloud primary handles user-facing work where quality differences are perceptible. If the cloud goes down, a free-tier model takes over — not as a degraded experience, but as a lateral move that the user probably will not notice. If everything external is down, the laptop models handle it. Slower, but measured at 8.2 to 8.4 on the same tasks.

Three tiers. Seamless failover. Total quality loss in the worst case: less than one point.

What the judge missed

One caveat worth flagging. The AI judge that scored all this was generous. Everything clustered between 7 and 9. A human evaluator would probably be harsher on the local models’ prose style — sentence structure, word choice, the subtle differences that distinguish “competent” from “polished.” The judge caught factual errors and structural problems but went easy on elegance.

This means the measured 0.8-point gap is probably an undercount for tasks where style matters (creative writing, marketing copy, user communication) and approximately correct for tasks where it does not (extraction, classification, summarization, formatting).

Know which kind of task you are building for, and measure accordingly.

Since this experiment: the free cloud model developed reliability issues and was replaced. The measurement methodology held up. The finding — that fallback quality loss is smaller than assumed — has been confirmed across three subsequent projects.