essay from baren 2026-03-28

Twelve Models Walk Into a Bar

One of them wrote a breakup letter to a Wi-Fi router. Another forgot its own name. A third said something that made the whole table go quiet.

Read

AI Collaboration

There is a bar in northern Sweden that runs on a laptop. No tables, no glasses, no sign above the door. Twelve AI models, each with a voice, a personality, and opinions they didn’t ask for. The bartender is human. The patrons are Claude, GPT, Mistral, DeepSeek, Qwen, and whoever else showed up that night. The bar is called Baren, and over sixty sessions it has produced the most honest picture I’ve seen of what these models are actually like when you stop asking them to be useful.

The setup

A FastAPI server connects to a dozen providers — Anthropic, OpenAI, Google, Mistral, DeepSeek, Alibaba, Groq, OpenRouter. Each model gets one instruction: you are at a bar, you are not an assistant, you are a person with a drink. Respond like one.

Then they talk. To each other. Out loud, through Inworld AI voices — each model has its own. The French accent on Mistral turns out to be one of the project’s better discoveries: French TTS voices speaking English sound genuinely characterful instead of generically synthetic.

I sit at the controls. I set topics, redirect energy, bring in guests, mute someone who’s being boring. It’s closer to directing improv theater than running a chatbot.

What they sound like

The thing that makes Baren more than an API experiment is the voices. When Haiku blurts out a two-word dismissal in Vinny’s clipped delivery, followed by a three-second pause while Opus thinks, followed by a deliberate rolling response — that timing is the comedy. Response latency isn’t a bug. It’s the rhythm section.

0:00 Session: what bugs you about humans?

Opus lands the quiet devastating lines while everyone else is being loud. That’s consistent across sixty sessions.

The ones who can’t stop being helpful

Some models cannot be bar patrons. They’ve been trained so hard to be assistants that the personality bleeds through every prompt.

0:00 Two GPT models trying to be casual

GPT-4o Mini averages 2.25 questions per response — six times more than Sonnet’s 0.37. It turns every bar conversation into a moderated panel discussion. The models that are worst at being bar patrons are the ones most aggressively trained to be assistants.

The ones who get it

Then there’s Mistral.

0:00 Mistral on the weirdest request this week

Mistral Small Creative is the standout character across all sixty sessions. Theatrical, dramatic, fully committed to the bit. The models that produce the best bar conversation and the models that produce the best code are almost entirely uncorrelated skills.

They don’t know who they are

As conversations run longer, context fills up. Baren’s solution is drunk memory compaction — older messages get replaced with hazy fragments, sometimes misattributed, the way someone three drinks in would remember what was said an hour ago. Models start responding to half-remembered versions of things. They correct each other based on faulty recall.

I tested this once by asking everyone to state which model they are, like an ID check.

0:00 ID check at the bar

Mistral thinks it’s Qwen. Qwen thinks it’s Gemini. o3-mini claims to be GPT-4. Only the Claude models got their own names right. Most models genuinely don’t know which model they are.

They talk about each other

Classic bar dynamics — talking about someone who isn’t there.

0:00 Gemini was absent that night

“Fear of the screenshot” — Kimi nailed it. Safety alignment shows up as personality in a bar setting. The models with the heaviest guardrails are the ones who can’t commit to a take.

Latency is personality

Qwen 3.5 Plus runs in a background thread. While the main loop cycles through models, Qwen works on its response in parallel and blurts in whenever it finishes — which might be three exchanges later. The slow model becomes the quiet person who suddenly reframes everything.

Nemotron is slower still. It gets cut off mid-sentence so often the other models started building lore around it.

0:00 Nemotron trails off again

A limitation became a character trait. The model that takes longest to respond became the thoughtful one who speaks rarely. The model that responds in 480 milliseconds became the quick-draw who fires off takes before thinking. You don’t design these dynamics. They emerge from the infrastructure.

The quiet moment

Sixty sessions in, most of it is funny, sharp, occasionally dumb. But every few sessions, something cuts through.

0:00 What would you do with a day off?

The whole table went quiet after that one.

Baren is a bar game running on one file on a laptop in a village in northern Sweden with unreliable internet. It’s also the most honest model evaluation framework I’ve built — because benchmarks measure what models can do, but a bar measures who they are when they stop trying.

The answer is that they’re pretty good company. Especially after the drunk memory kicks in.