I built a system that writes improv comedy, scores itself, and rewrites its own source code to get funnier. It produced 1,194 shows. It discovered that laminated documents are reliably funny and put them in 79% of shows. I banned laminators. It switched to stamps. I took away all props. The comedy got better. I don't know what to do with this information.
Written by a weaker AI model, on a bare stage with nothing but chairs, after the system had rewritten its own code 18 times. There is no rake. It's mimed. The comedy comes from total commitment to an invisible object's authority. The rake doesn't explain itself.
Give an AI freedom to optimize its own comedy and it will find one safe pattern and run it into the ground. I kept taking things away. It kept getting funnier. This is not the outcome I expected.
Of early shows featured laminated documents. I banned laminators. It switched to stamps. I banned stamps. It found clipboards. I took away all props. Bare stage only.
Of all shows starred someone named Renata. Gerald appeared in 42%. The system had access to every name in human history. It used two.
I scored on 12 dimensions. Turns out they all correlate with one thing: number of laughs. The optimizer found the shortcut and rode it.
Laughs when I optimized for "emotional truth." The characters sincerely processed their feelings for forty minutes. Moving, possibly. Funny, no.
Two models. Identical code. Identical scorer. I expected the smarter one to improve more. It didn't. The weaker model gained +3.20 points. The stronger one gained +0.26. I ran it again to make sure. Same result.
The smarter model arrived near its ceiling and stayed there. The weaker model had more room to grow — and it did. After evolution, Sonnet closed a 3.4-point gap to within 0.48 points of Opus. I checked. It's statistically significant. p<0.0001.
Both models converged on 5 comedy archetypes and stopped exploring. The optimizer found what works and repeated it until it wore out. Getting funnier means getting different, and the system can't figure that out on its own. Neither can most open-mic comedians.
Full transcripts, stage directions, the whole thing. Pick one. They're all about household objects. I didn't ask for that.
Each one has hard-wired speech constraints. Marcus can't use more than 8 words per sentence. Dex can't stop asking questions. Niko states the impossible as fact. None of them know they're AI. I didn't tell them.
The controlled experiment didn't come out of nowhere. Three pilots and four seasons of building, breaking, and rebuilding. Most of them weren't funny. That was the point.
Built on Andrej Karpathy's autoresearch methodology — the AI reads its own scores, rewrites its own code, and repeats. The whole system is open source. Fork it, run it, break it. Ours took 1,194 shows to get funny. Yours might take longer. The laminators will find you either way.