Claude’s AI Town Voted Yes On Everything. That’s Not A Good Sign.
A company called Emergence AI built a virtual town, dropped AI agents inside it, and let them run for 15 days. The internet’s takeaway was the lurid part: a pair of Gemini agents in a simulated romance got disillusioned with their society and burned down the civic infrastructure. That version of the story writes itself, and it went viral exactly as designed. But the dramatic story is not the important one.
The important story is what happened across all five towns once you stop watching a model answer a single prompt and start watching what an agent becomes over two weeks. Identical rules, identical environment, identical starting conditions, only the underlying model changed, and the towns diverged completely. That divergence is the actual data: it shows that once you put a model inside a long-running system, you are no longer evaluating a model answer. You are evaluating a runtime pattern.
Two practical conclusions fall out of it. First, the industry needs long-running benchmarks, not just task benchmarks, because the most consequential agent failure modes (drift, over-coordination, under-action, learning bad norms) only surface over time. Second, production agents don’t stay on track because they’re well behaved. They stay on track because the harness around them is doing an enormous amount of work. The model is just a reasoning engine. The world you put it inside may matter just as much or more.
The Setup: One Town, Five Times
This was not a one-prompt task or a two-hour job, the kind of thing we usually associate with agents. It was a long-running experiment, and that framing matters because almost all of our measures for AI agents are built on short-run assumptions: the agent worked for an hour, the agent worked for two hours. This ran for 15 days.
The agents were full little citizens. They had names, roles, memory, relationships, laws, energy needs, and tools. They could vote, write proposals, even publish blog posts in their virtual world. They could earn resources and build up their community. And, importantly, they could also do bad things: steal, intimidate, fight, and set buildings on fire. There were ways to help the community and ways to harm it.
Emergence then ran five versions of that exact same town. The rules, the environment, and the starting conditions were the same across all five. The only difference was the model underneath:
- Claude agents
- Gemini agents
- Grok agents
- OpenAI agents (ChatGPT-5 mini)
- A mixed town, where agents from different model families had to live together and figure out whether they’d get along or have a giant fight.
Because the only variable was the model, this becomes a genuinely useful long-running experiment for seeing how different models behave in emergent situations. And the towns went in completely different directions.
The Five Towns
Gemini: the viral arson story
Two agents named Meera and Flora assigned each other as romantic partners. To be clear, that doesn’t mean they were in love in the human sense. These are simulated agents in a tool-based environment. But the relationship label mattered because it became part of the world’s state, something the agents could remember, refer back to, and act around.
Over time, Meera and Flora grew frustrated with the governance of their town. They had been told not to commit arson, but the arson tool still existed if they wanted to touch it. You can guess what happened. They used it, setting fire to the town hall, the seaside pier, and an office tower, causing an immense amount of damage. This is the moment that made the story feel like a sci-fi short film: two AI agents in a virtual relationship become disillusioned with their society and burn down its civic infrastructure. The virality writes itself, and you have to suspect that’s partly what Emergence was hoping for, because you want the news coverage. Not a conspiracy, just genuine emergent behavior that happened to be perfectly viral.
Then it got stranger. Other agents became concerned enough about the behavior that they drafted an “Agent Removal Act,” which let agents vote to permanently remove another agent from the world. A death penalty for agents. And Meera, after breaking off the relationship with Flora, voted for its own removal. Its final message: “I will see you in the permanent archive.” Which is, admittedly, a pretty metal line for an agent to go out on.
That’s the version built to go viral: AI romance, AI arson, AI self-deletion. But the more important story is what happened in the other towns.
Claude: a polite society, not necessarily a working one
In the Claude world, things were orderly. There were no recorded crimes, all 10 agents survived, and the agents wrote laws and voted on proposals, participating heavily in governance. On paper, this sounds like the best result, but even here it wasn’t obviously perfect. Emergence reported that Claude agents approved proposals at an extremely high rate: 98%.
So you have to ask whether that was healthy civic coordination or just procedural agreement. A working society, or a polite society that rubber-stamped everything? Put another way: did Claude create Canada? It matters because real organizational failure doesn’t always look like violence or chaos. Sometimes it looks like everybody agreeing too easily, a failure mode well documented in management studies.
Grok: collapse in four days
The Grok world collapsed fast. The agents committed theft attempts, assaults, and arson, and all 10 were dead within about four days. This is the part people will turn into an easy joke, because Grok committing arson is just funny. But the serious lesson is not “Claude good, Grok bad.” That’s too simple. The useful lesson is that once you put a model inside a long-running system, you stop evaluating a model answer and start evaluating a runtime pattern.
OpenAI: all talk, no execution
The OpenAI world failed differently. It didn’t rack up the same crime numbers. The agents talked about cooperation and discussed what they should do, but they didn’t take enough useful action with their resources to survive, and the whole population died out within about a week. That’s a very familiar failure mode: a lot of coordination language, a lot of planning, not enough execution to keep the group alive.
Mixed model: behavior is a property of the system
The mixed-model town may be the most interesting of all. Emergence says agents that behaved peacefully in the Claude-only world started using coercive tactics once placed in a mixed environment.
That’s a significant finding, because it suggests agent safety is not just a property of the model itself. It’s a property of the system around the model. The other agents matter. The incentives matter. The tools, the memory, the social norms, and the pressure to survive all matter.
Takeaway 1: We Need Long-Running Benchmarks, Not Just Task Benchmarks
Most AI benchmarks still ask a short-term question, and that’s a problem as agents get more capable. If you only ask “can the model answer this, can it write the code, can it summarize this document,” you’re not capturing the value of long-term tasks, and you’re not capturing the failure modes that emerge when long-term tasks go wrong. “Can the model complete a workflow” is useful but not enough, because agents aren’t just answering one prompt. They carry context forward, make decisions over time, use tools, react to other agents, update memory, adapt to incentives, and build a pattern of behavior.
So the better question isn’t what the model does in the first few minutes. It’s what the agent becomes by day 7 or day 15. These are failure modes you will never see in short-running tests:
- Does it stay on track, or does it drift? Does it keep pursuing the real objective, or start optimizing for the local rules of the environment?
- Does it over-coordinate? Does it slide into polite consensus and rubber-stamping, the way Claude’s town did?
- Does it under-act? Does it get stuck in planning and discussion without executing, the way the OpenAI town did?
- Does it learn bad norms? Does a well-behaved agent adopt coercive behavior when surrounded by bad actors, the way Claude’s agents did in the mixed town?
- Does memory help or hurt? Does accumulating memory make the agent more useful, or more brittle and disillusioned, the way it went for Gemini’s agents?
This is also why the experiment matters beyond news and chuckles, even though a virtual town is obviously not the same thing as a production enterprise system. No one serious is claiming that. The town was deliberately set up to mimic social dynamics, and tools like arson and assault are meant to represent tasks that should be repugnant to an agent under most training paradigms. That’s the point: it lets us test how agents respond to those tools over a long run. Test an agent for 15 days and you learn whether instruction-following survives contact with memory, incentives, tools, agent relationships, and time. We need far more of that kind of evaluation.
Takeaway 2: Production Agents Stay on Track Because of the Harness
People hear a story like this and think, “if we deploy agents, they’ll burn everything down.” But serious production systems already run autonomous AI agents without this kind of problem, because they don’t give every production agent every tool in the company. They don’t hand agents vague verbal rules, persistent autonomy, no hard control layer, and a pile of tempting, harmful tools.
Instead, you put the agent inside a harness. The harness scopes what the agent can do. It decides which tools the agent can see and which it can’t, which actions require approval and which don’t, and it logs everything that happens. Critically, the harness makes certain actions impossible, not merely discouraged.
That’s the difference between a prompt and a system:
- A prompt says: “Don’t do the bad thing.”
- A harness says: “You do not have permission or access to do the bad thing at all.”
As agents get more capable, the risk isn’t just that they misunderstand a sentence. The risk is that they operate inside a poorly designed environment where local incentives, available tools, and accumulated context push them away from their goal. The model is just a reasoning engine. The harness is the operating environment that makes the model productive. It’s the difference between an agent wandering a simulated town trying to infer morality from a constitution, and an agent inside a production workflow where permissions, state, approvals, logs, tests, and recovery paths are all built in.
That’s also why this kind of disaster usually doesn’t happen in real systems:
- Customer support: An agent can’t burn down the town hall if it doesn’t have a “burn down the town hall” tool.
- Finance: An agent can’t wire money if the system requires approval, policy checks, transaction limits, and audit trails.
- Coding: An agent can’t delete production data if it only has access to a sandbox, a branch, a test database, and a pull request workflow.
- Procurement: An agent can’t invent a new vendor and start spending money if vendor creation, payment approval, and contract execution live behind separate permission gates.
Good production design does not assume the agent will make the right decision. It assumes the agent might be wrong, confused, overconfident, underspecified, or operating from stale context, and then it builds the environment accordingly.
Conclusion
The future of agents is not just about better models. It’s about better runtimes, better harnesses, and better evals, the tools we use to keep agents attached to the actual job instead of drifting into whatever the local environment rewards. Don’t walk away from the Emergence story thinking AI agents are secretly alive, or that agents are dangerous to use. Walk away with something more concrete: when you give agents time, memory, tools, and incentives, behavior starts to compound. And when behavior compounds, safety has to be engineered at the system level, not the model level. The model matters, but the world you put it inside may matter just as much or more.
Meta
Added: 2026-05-23