experiment 2026-02-23

Five Judges for Six Cents

How to build an AI evaluation panel that catches its own biases

Read

AI Collaboration

The first time I asked an AI to judge another AI’s work, it gave everything an 8 out of 10. The second attempt, with a different model, also produced 8s. Two judges, twenty evaluations, zero useful signal. Every output was apparently the same quality. They were not.

This is the rubber-stamp problem: AI judges default to “pretty good” the way restaurant reviews default to four stars. The score looks like an evaluation. It is actually a reflex.

The fix cost six cents.

The problem with one judge

A single AI evaluator has three failure modes, and they compound.

Score compression. Everything clusters between 7 and 10 on a 10-point scale. A mediocre output gets 7.5. A genuinely excellent output gets 8.5. The difference is invisible in the noise. You cannot rank, you cannot compare, you cannot make decisions.

Invisible bias. Every model has preferences baked into its training. Some favor concise output. Some favor verbose. Some go easy on outputs from their own provider family. With one judge, you cannot detect any of this. The bias is the evaluation.

No disagreement signal. When two humans disagree about quality, the disagreement itself is information — it tells you the evaluation is ambiguous, or that different criteria lead to different conclusions. One judge cannot disagree with itself. You get a score and no way to know if it is stable.

Two judges help with the third problem but not the first two. You need enough judges from enough different backgrounds to generate real variance and detect systematic bias.

Five judges, five providers

The panel has five models from five different providers. This is not overkill — it is the minimum for bias detection.

One judge is from the same provider as the outputs being evaluated. If that judge consistently scores those outputs higher than the other four do, you have caught a bias. If it does not, you have evidence of fairness. Either way, you know something you did not know before.

The other four span open-source models, commercial models, a Chinese lab, and a speed-optimized inference provider. Different training data, different architectures, different aesthetic preferences. When they agree, the agreement means something. When they disagree, the disagreement means something too.

Total cost per evaluation round: about six cents. For context, a single call to a good model costs two to four cents. Five judges cost less than two normal API calls. The marginal cost of going from one judge to five is negligible. The information gain is not.

The 1-to-5 trick

The single most important design decision: use a 5-point scale, not a 10-point scale.

On a 10-point scale, models cluster everything between 7 and 10. A score of 7 feels harsh. A score of 6 feels like a failing grade. The usable range is four points wide, which means you are asking the model to make fine distinctions in a narrow band where its confidence is lowest.

On a 5-point scale, 3 is the middle. It is not harsh, it is average. A 2 is clearly below average. A 4 is clearly above. The entire range is usable. Models that rubber-stamped everything at 7-8 on a 10-point scale produce genuine variance at 3-4-5 on a 5-point scale. The psychology of the scale changes the output.

This is not a hypothesis. In blind testing, the same models that compressed scores to 7-10 on a 10-point scale produced scores ranging from 2 to 5 on a 5-point scale. Same outputs being evaluated. Same prompts. Different scale, dramatically different signal.

What the panel catches

With five judges, you get things that one or two judges cannot provide.

Consensus strength. Five judges agreeing that output A is better than output B is qualitatively different from one judge saying so. When all five converge despite different training lineages and aesthetic preferences, the ranking is robust.

Disagreement mapping. When three judges prefer A and two prefer B, you know the comparison is close and probably depends on which criteria you weight. This is useful information. A single judge would have given you a coin flip disguised as a verdict.

Provider bias detection. Across multiple evaluation rounds, you can track whether any judge systematically favors outputs from models it is related to. In practice, we have not found strong provider self-preference — but we know we would catch it if it existed, which changes how much we trust the results.

Rubric calibration. Different judges are generous on different dimensions. One might score code quality high while another scores it low on the same output. When you see this pattern, the rubric dimension is ambiguous — not the judges. Fix the rubric.

The process

Send each judge the same outputs with the same rubric. Force JSON responses with per-dimension scores and justifications. Aggregate. Look for outliers. If a judge consistently fails to produce valid JSON or compresses every score to the same value, replace it.

The rubric changes per evaluation type. Comparing code gets different dimensions than comparing writing. Thirty seconds of rubric design pays for itself immediately — a coding rubric applied to a writing task produces garbage scores regardless of how many judges you use.

Run it three times and average for important decisions. That is eighteen cents for a statistically meaningful evaluation with bias detection and disagreement mapping. I have spent more than that on coffee that was worse than the backup coffee.

Since this experiment: two of the original five panel members hit provider issues (expired free credits, reliability problems). The methodology survived — swapping in replacement models from other providers took minutes. The architecture is designed for this: judges are interchangeable as long as the panel stays diverse.