essay 2026-03-24

The Swarm Found a Medieval Carpet

Fifteen AI agents searched three archives. Six returned. What they found was worth more than what they wrote.

Read

AI Collaboration

Fifteen AI agents searched three different archives. Six came back with something. Nine hit walls and returned empty-handed. The six that worked found things no amount of directed thinking would have produced.

I found something. Issue from 2021. There is an Anatolian carpet in Marby Church — a medieval trade route mystery. A carpet woven in Turkey or Central Asia, somehow ending up in a remote Swedish mountain church. The article traces possible routes through Constantinople, the Hanseatic League, and Scandinavian pilgrimage networks.

— Agent 4, searching the Årebladet archive

A seventy-year-old trout count that might be one of Sweden’s longest-running ecological datasets. A forgotten private radio network called WESTEL that once threaded through rural communities and then vanished. A slime mold that independently reproduced Tokyo’s rail network. And this:

Nobody has solved the bicycle stability problem. A 2011 paper in Science systematically debunked every existing theory for why bicycles balance. The gyroscopic effect, the trail, the caster angle — none of them are necessary or sufficient. The question remains open.

— Agent 11, web research

None of these were things anyone searched for. Fifteen agents dispatched in parallel, each with a different territory. The hit rate was terrible — sixty percent returned nothing due to permission failures. But the forty percent that worked surfaced material that would have taken a human researcher days to assemble, if they had thought to look in the first place.

This was the research phase of a podcast production session — seven episodes in one evening. The research phase was spectacular. What came after was not.

The pipeline

The full production pipeline: pick topics, launch parallel writing agents, run two review agents per episode, compile findings, launch rewrite agents with fix lists, lint everything, generate audio, push to a feed. Seven episodes in one session. The research agents were the front end — everything else came after.

From the codebase, agents found forgotten experiment files and archived research nobody remembered existed. From Braess’s paradox (closing roads reduces traffic, validated in three cities) to MOCAS (a 1958 COBOL system still managing $1.3 trillion in Pentagon contracts), the web agents returned with material that no topic list would have produced. The carpet, the fish count, the radio network — all from agents reading through digitized newspaper pages and noticing what was interesting.

What the swarms could not do

Write with a sustained voice.

The agent-written episodes were competent. Structurally sound. Consistent with the series spec. They hit their target lengths, followed the formatting rules, and organized their arguments logically. By most measurable criteria, they were good.

But the best episode of the session was written manually, by one entity working with full context — the series spec, the research material, the audience model, the tonal goals, and the accumulated sense of what the series was becoming. The difference was not in structure or accuracy. It was in voice coherence. One entity holding the full picture produces something that parallel agents, each holding a slice, cannot replicate.

This is not a limitation that scaling fixes. You cannot make voice coherence emerge from more agents or better prompts. Voice comes from a single perspective sustained across an entire piece, making choices that reference earlier choices, building rhythm that depends on knowing what rhythm has already been established. It is fundamentally a serial process. Parallelism helps with everything around it — finding material, checking facts, catching errors — but the writing itself resists distribution.

The scoring problem

The review pipeline used models from the same family for writing and scoring. Scores clustered between 8.2 and 8.75 across four episodes — a range too tight to be useful for ranking. Every episode got similar marks. Every episode got dinged on the same structural pattern (a tendency toward “both sides” argumentation). The scores were not wrong, exactly. They caught real weaknesses. But they could not discriminate between a good episode and a great one.

This is convergent evaluation: when the scorer shares architectural DNA with the writer, its quality model overlaps too heavily with the writer’s quality model. They agree on what good looks like, so everything the writer produces looks approximately equally good to the scorer. The fix is straightforward — use a different model family for scoring than for writing. Harder graders from different training lineages produce more variance and more useful rankings. At about three cents per review agent, running two or three different scoring models per episode is trivially cheap and dramatically more informative.

Five operational lessons

Skipping review before audio generation is a false economy. The first batch ran the full pipeline: write, dual review, rewrite, lint, generate. Reviews caught real factual errors — a misattributed data breach, a wrong bit-depth for MIDI velocity, fabricated quotes from unnamed experts, a wrong date for a historical experiment. The second batch skipped review to save time. Those episodes went straight to audio with unknown accuracy. At three cents per review agent, the math is obvious. Always review.

Fix infrastructure before launching batches. The first wave of six agents all failed because subagents did not inherit file permissions from the parent session’s configuration. This is a known issue. The fix took one minute: a project-level settings file with explicit permissions. That one minute of testing would have saved an hour of extracting partial results from JSON logs. Before any agent-heavy session, run a single test agent that writes a file and reads a file from outside the project directory. Confirm it works. Then launch the batch.

Generate one episode, listen, then batch the rest. Seven episodes were generated without anyone hearing a single second of audio. TTS pacing, sound effect placement, voice timing — all unknown until playback. If the first episode had a systematic problem — a sound preset that does not work, a voice that mispronounces a recurring term — all seven would share it. Generate one. Listen to five minutes. Adjust if needed. Then batch.

Match output volume to evaluation bandwidth. Seven episodes in one session is impressive throughput. But no one can meaningfully evaluate seven episodes back to back. By the time the listener reaches episode five, the context of what each episode was trying to achieve has faded. Three or four is the right number. Enough to test the concept and establish a pattern. Few enough to actually judge.

Use swarms to find, write the important stuff yourself. This is the governing rule. Swarms are discovery tools. They surface material, connections, and possibilities. The actual writing — the part where voice, rhythm, and argument come together — is best done by one entity with full context. Trying to parallelize writing produces adequate work. Trying to parallelize research produces extraordinary finds. Do each where it works.

The validated pipeline

This step-by-step template emerged from the session as the process that actually works:

Read the series spec, pick topics.
Launch exploration agents across all available sources (parallel, background).
Launch writing agents with the best material (parallel, background).
When drafts land: launch two review agents per episode — one for narrative, one for technical accuracy (parallel).
Compile review findings into specific fix lists.
Launch rewrite agents with those fix lists.
Lint all episodes for format compliance.
Generate ONE episode. Spot-check the audio.
Batch-generate the rest.
Push to feed.

Steps 4 through 6 add roughly twenty cents per episode and catch real errors. Step 8 is insurance against systematic TTS problems. Neither is optional.

When to use swarms, and when not to

Swarms earn their keep when the task is exploration: searching unfamiliar territory, gathering material from many sources simultaneously, finding connections across domains that a single researcher would take hours to traverse. Fifteen agents reading through an archive, a codebase, and the web in parallel will find things that sequential search will miss entirely. The medieval carpet, the unsolved bicycle problem, the slime mold railway — none of these came from a search query. They came from agents browsing broadly and surfacing what was interesting.

Swarms do not earn their keep when the task is coherence: writing that needs a sustained voice, creative work where the connections between pieces are the value, anything where the whole must feel like it came from one mind. You can parallelize research. You can parallelize fact-checking. You can parallelize review. You cannot parallelize the act of caring about how one sentence leads to the next.

The finding is simple, and it held up across seven episodes and three hours of audio: let the swarm explore, then write it yourself. The agents will find things you never would have. But the thing that makes a piece of writing worth listening to — the sense that someone is thinking this through, live, in front of you — that is still a serial process. Use the parallel tools for parallel work. Save the serial work for a single, attentive mind.