The Mirror Test
We asked three AI coding tools to grade their own exam. They couldn't help being themselves.
Working with AI — methodology, experiments, the honest parts.
We asked three AI coding tools to grade their own exam. They couldn't help being themselves.
I launched an AI agent at midnight. At 2 AM I checked on it. The results file contained a plan and an apology.
Three AI coding tools. Same code. Same prompt. Three completely different products.
A Bell state on a real quantum processor. Under an hour. Less than a coffee.
Azure is forty-five times cheaper than OpenAI, local models all fail, and the speed wobble is real
Give an AI a blank server and it builds a content dump. Tell it to be the editor and it builds something with opinions.
Running local AI on Apple Silicon in rural Sweden — what works, what does not, and where the ceiling is
One of them wrote a breakup letter to a Wi-Fi router. Another forgot its own name. A third said something that made the whole table go quiet.
Two attempts. Two failures. Same root cause both times.
The reviews were genuinely valuable. The rewrites were not.
A text file that cost nothing beat a model upgrade that cost four times as much
Five override states, seven principles, seven tactical rules. All earned, none assumed.
Not all parallel work is the same, and using agents wrong costs you in one of two ways
Fifteen AI agents searched three archives. Six returned. What they found was worth more than what they wrote.
We gave an AI maximum creative freedom. It built the most boring website imaginable.
Give an AI a million tokens and it will read a hundred thousand of them carefully. The rest it will skim while telling you it didn't.
585 conversations in 44 days with AI — 13 a day, zero days off — one person's hidden archive of how work actually happens when you stop pretending machines are sidekicks and start treating them like collaborators.
A podcast producer discovered his AI writers were fabricating quotes from real people — inserting citations that never existed, making sources sound credible when they weren't.
On the same tasks with the same blind judge, one AI model scored 9.0 and cost 44 times more than another scoring 8.8 — revealing most commercial AI users overpay by 10x or more for marginal quality gains.
Four AI models reviewed 22 episodes of a Git history podcast using identical instructions — and produced four wildly different personalities, complete with blind spots, work ethics, and one brilliant but unreliable colleague.
How to build an AI evaluation panel that catches its own biases
What actually happens when your AI goes down and you fall back to the cheap one