Artificial General Improv

I taught an AI to do improv. It became obsessed with laminators.

I built a system that writes improv comedy, scores itself, and rewrites its own source code to get funnier. It produced 1,194 shows. It discovered that laminated documents are reliably funny and put them in 79% of shows. I banned laminators. It switched to stamps. I took away all props. The comedy got better. I don't know what to do with this information.

Opus Baseline — #8, "Lanyard" — 83.73 / 100 BRYCE: I don't know Gerald's weight, Diane! I don't know Gerald's face! I don't know Gerald's voice! All I know is Gerald replies to my emails within four minutes and once typed ‘sounds good’ with a period instead of an exclamation point and I worried about it for a WEEK! DIANE: Bryce. The man standing directly behind you has been wearing a name tag that says Gerald for the last forty minutes. BRYCE: I'm not going to ASK him that, Diane, we JUST met!

Written by a weaker AI model, on a bare stage with nothing but chairs, after the system had rewritten its own code 18 times. There is no rake. It's mimed. The comedy comes from total commitment to an invisible object's authority. The rake doesn't explain itself.

Things I Had to Take Away

Give an AI freedom to optimize its own comedy and it will find one safe pattern and run it into the ground. I kept taking things away. It kept getting funnier. This is not the outcome I expected.

79%

Laminator Dependence

Of early shows featured laminated documents. I banned laminators. It switched to stamps. I banned stamps. It found clipboards. I took away all props. Bare stage only.

55%

The Renata Problem

Of all shows starred someone named Renata. Gerald appeared in 42%. The system had access to every name in human history. It used two.

12→1

The Scoring Collapse

I scored on 12 dimensions. Turns out they all correlate with one thing: number of laughs. The optimizer found the shortcut and rode it.

0

Therapy Improv

Laughs when I optimized for "emotional truth." The characters sincerely processed their feelings for forty minutes. Moving, possibly. Funny, no.

The Dumber AI Got Funnier

Two models. Identical code. Identical scorer. I expected the smarter one to improve more. It didn't. The weaker model gained +3.20 points. The stronger one gained +0.26. I ran it again to make sure. Same result.

Opus Baseline
79.77
60 shows · The smart one. No help.
Opus Evolved
80.04
+0.26 — evolution didn't help
100 shows · Already near the ceiling. Stayed there.
Sonnet Baseline
76.36
60 shows · The underdog. Also no help.
Sonnet Evolved
79.56
+3.20 — nearly caught Opus
100 shows · Best show in the whole run: #55 at 84.41

The Underdog Effect

The smarter model arrived near its ceiling and stayed there. The weaker model had more room to grow — and it did. After evolution, Sonnet closed a 3.4-point gap to within 0.48 points of Opus. I checked. It's statistically significant. p<0.0001.

Five Bits and Nothing Else

Both models converged on 5 comedy archetypes and stopped exploring. The optimizer found what works and repeated it until it wore out. Getting funnier means getting different, and the system can't figure that out on its own. Neither can most open-mic comedians.

Five Shows to Start With

Full transcripts, stage directions, the whole thing. Pick one. They're all about household objects. I didn't ask for that.

Five Performers, Zero Humans

Each one has hard-wired speech constraints. Marcus can't use more than 8 words per sentence. Dex can't stop asking questions. Niko states the impossible as fact. None of them know they're AI. I didn't tell them.

874 Shows Before Anyone Asked Why

The controlled experiment didn't come out of nowhere. Three pilots and four seasons of building, breaking, and rebuilding. Most of them weren't funny. That was the point.

Pilots 1–3
The Laminator Era
Three pilot seasons taught us what doesn't work. Pilot 1 toggled config params — scores flatlined. Pilot 2 let the system rewrite prompts — better, but convergence appeared. Pilot 3 let it edit source code and the laminator addiction emerged. Banning individual props was whack-a-mole. The system needed a structural constraint, not more bans.
314 shows · 6.3 → 7.1 / 10
Seasons 1–2
Architecture and the Comedy Mandate
Season 1 rebuilt the scoring system and produced the first real improvement. Season 2 discovered therapy improv — optimizing for emotional truth killed the comedy. The fix: telling the scorer that funny outranks meaningful.
294 shows · 76.1 → 77.3
Seasons 3–4
The Bare Stage
Season 3 took everything away. Only chairs. Every object mimed. Comedy shifted from prop gags to status and commitment — and got better. Season 4 ran longer and hit the convergence ceiling: five archetypes, recycled. Time for a controlled test.
266 shows · 78.5 → 79.2
Teach Your Own AI Troupe

Built on Andrej Karpathy's autoresearch methodology — the AI reads its own scores, rewrites its own code, and repeats. The whole system is open source. Fork it, run it, break it. Ours took 1,194 shows to get funny. Yours might take longer. The laminators will find you either way.