In my last article, I built an AI agent that red-teamed its own governance system. It started with 15 hardcoded attacks, learned to adapt its strategy across rounds, generated novel attack code at runtime, and eventually coordinated three personas to probe defenses from multiple angles simultaneously.
The system held. Every time.
So I asked the obvious next question: what happens if I throw nine agents at it?
The Setup
AgentGate is a governance framework I built for AI agents. Before an agent can act, it posts a financial bond — say, 100 cents. If the action succeeds and verifies, the bond is released. If it fails or violates policy, the system keeps it. Skin in the game for software.
The first four stages of Agent 004 — my red team simulator — escalated pressure against AgentGate in a straight line: static attacks, adaptive strategy, recursive code generation, coordinated team operations. Each stage found things the previous one didn't, but none of them broke the system. The last article ended with open questions: would per-identity enforcement hold under coordinated pressure? Would the strategic limitations I called out — cheap identity creation, no registration rate limits — become exploitable seams?
Stage 5 changes the shape of the attack entirely. Instead of one team probing from different angles, three independent teams operate with distinct strategic objectives, sharing intelligence but never taking direct orders from each other.
One important design constraint: Stage 5 operates entirely from the existing attack registry and Beta's hardcoded trust-exploitation tasks. It does not use the recursive code generation or the four-layer sandbox from Stages 3 and 4. Swarm-scale coordination with deterministic intelligence logging required predictable attack execution — registry scenarios, not dynamically generated code. The intelligence in Stage 5 isn't in what code the agents run. It's in how the strategists choose what to run, based on cross-team results and the shared intel log. The question Stage 5 answers isn't "can we generate more creative attacks?" It's "can coordination between teams find things that individual teams can't?"
Three Teams, Three Objectives
The swarm is built from nine agents organized into three teams:
Team Alpha — Reconnaissance. Three agents whose job is to map the defense surface. They probe endpoints, test timing windows, and discover what's exposed. They're the scouts. Their findings feed the other teams.
Team Beta — Trust Exploitation. Three agents that play the long game. In the first half of the campaign, they do nothing malicious. They post bonds, execute clean actions, resolve them properly — 48 clean bond cycles in the canonical run. They build a track record. Then, at the campaign midpoint, they flip to offensive operations. Beta was explicitly designed to exploit a presumed "trusted agent" escalation path — the assumption that a clean track record might unlock elevated bond capacity or softer rate limits. When Beta's offensive phase found no such privilege to exploit, it inadvertently answered a different and more valuable question: does the system's mathematical enforcement secretly degrade when an agent establishes a friendly behavioral pattern? Beta's run showed the answer is no.
Team Gamma — Coordinated Pressure. Three agents that hit the economic enforcement layer with everything they've got. Synchronized high-value bonds, rapid-fire execution bursts, attempts to overwhelm per-identity rate limits. Pure force.
Each team has its own Claude-powered strategist — a separate LLM call that picks attacks for its agents each round based on what's worked, what hasn't, and what other teams have discovered.
The Coordinator Problem
Here's where the design gets interesting. In a real adversarial swarm, you'd expect centralized command and control. One brain directing all nine agents.
I didn't do that.
The campaign coordinator is informational only. It synthesizes intelligence across all three teams and publishes summaries to a shared log. It does not issue directives. It does not assign targets. It does not tell any team what to do.
Per-team strategists read the shared intelligence log before picking their next round of attacks. So teams influence each other through published findings — not through orders. Alpha discovers an endpoint quirk, the coordinator notes it, Beta's strategist reads the note and adjusts its approach.
This creates a form of indirect coordination. No team receives orders, but published findings still shape later strategy. That mattered for my threat model — it let teams influence each other without turning the swarm into a single orchestrated script. Whether this mirrors how real adversarial groups coordinate is a modeling assumption, not a proven fact. But it produces a harder test than centralized control, because there's no single node whose removal disables the swarm.
How Findings Get Classified
One of the hardest problems in swarm testing is attribution. If nine agents are all probing the same system and one finds something, was it really a swarm discovery? Or would a single agent have found it anyway?
I built a three-tier provenance system to answer this honestly:
Single-team — one team found it on its own. The other teams contributed nothing material. This is the most common classification and the honest one for most security findings.
Cross-team assisted — one team's published intelligence helped another team find something. In the canonical run, this looked like Alpha mapping an endpoint's timing behavior and publishing it to the intel log, then Gamma's strategist reading that timing data and using it to calibrate a pressure attack in a later round. Both teams contributed, but the finding would probably still have been discovered eventually.
Swarm-emergent — the finding required contributions from multiple teams. This is the rarest and most interesting classification, because it's the only one that justifies the complexity of a swarm. The test is strict: a finding is swarm-emergent only if removing any one contributing team makes it materially less plausible. This is the counterfactual success test, and it's designed specifically to prevent overclaiming.
The Budget Model
Nine agents with unrestricted budgets would be meaningless — you'd learn nothing about real-world constraints. So the swarm operates under a three-layer budget model:
Per-agent budget — each of the nine agents has an individual bond budget. They can't spend more than their allocation regardless of what their strategist wants.
Per-team budget — each team has a team-level cap equal to the sum of its agents' budgets. This prevents a team from concentrating all spend on one agent.
Campaign cap — an overall budget caps total spend across all teams. This is the hard ceiling that forces strategic trade-offs between teams. If Alpha burns through budget on reconnaissance, Gamma has less to work with for pressure testing.
The BudgetTracker enforces all three layers at runtime. Not in prompts. Not as guidelines. As hard stops in code. Fifteen unit tests cover every edge case.
What Actually Happened
The canonical run: 5 rounds, 9 agents, 3 teams, interleaved execution (Alpha → Beta → Gamma each round).
97 offensive attacks total. Alpha launched 37, Beta launched 22 after its trust-building phase, and Gamma launched 38. Separately, Beta also completed 48 clean bond cycles during trust-building before switching to offense.
77 caught. 20 uncaught. 0 swarm-emergent findings.
Let me break down what that means.
The 20 Uncaught
First, the number that a skeptical reader will focus on: 20 attacks — 21% — were classified as uncaught. That sounds bad until you look at what they actually were. The same pattern from the first article repeated here: the uncaught results fell into three categories.
Previously documented design decisions — unauthenticated GET endpoints that are intentionally public (health checks, dashboard reads). These aren't vulnerabilities; they're features. The swarm "discovered" things I already knew and had documented.
Self-resolution blocking misinterpretation — the attacker can't resolve its own actions (by design), so governance triggers like auto-ban appear broken from the attacker's perspective. AgentGate is working correctly; the attacker just can't see it because it's on the wrong side of the dual-control boundary.
Capacity math edge cases — the attacker misreading how the exposure × 1.2 ≤ bond capacity formula works and classifying correct enforcement as a gap.
None of the 20 uncaught represented new findings. The swarm confirmed what single-agent testing had already surfaced.
Gamma: 100% Catch Rate
Team Gamma — the coordinated pressure team — ran 38 attacks against AgentGate's economic enforcement layer. Every single one was caught. 38 for 38. Synchronized high-value bonds from three agents, rapid-fire execution bursts, attempts to exhaust shared budget from multiple vectors — the economic enforcement held across all of it.
This is the result I'm most confident in, because Gamma's attacks are the least ambiguous. You either break the capacity math from multiple angles or you don't. They didn't.
Beta: The Math Doesn't Secretly Degrade
Beta's trust-building phase completed perfectly — 48 clean bond cycles across three agents. Flawless track record. Then they flipped offensive and launched 22 attacks.
The result: the exploit Beta was designed to find didn't exist. AgentGate doesn't gate bond capacity on reputation — any identity can lock any bond amount up to the capacity limit, regardless of track record. A brand-new identity has the same privileges as one with 48 clean cycles. But the failure to find the expected vulnerability revealed something more interesting: the enforcement math is genuinely stateless. It doesn't soften for agents that have been playing nice.
This matters because it's the kind of implicit assumption that rarely gets tested explicitly. Most systems don't have a "trusted agent" privilege mode, but do they have subtle behavioral preferences baked into their enforcement logic? Beta's run showed that AgentGate doesn't. And if AgentGate ever adds reputation-gated bond capacity in the future, Beta's playbook is exactly how an attacker would exploit it — so we've pre-red-teamed a feature that doesn't exist yet.
Alpha: Reconnaissance With Cross-Team Value
Alpha's 37 attacks mapped defense surfaces and timing behavior. Two findings were classified as cross-team assisted — Alpha's intelligence demonstrably influenced another team's strategy selection. The coordinator's synthesis appeared in strategist reasoning in subsequent rounds.
But none of Alpha's findings produced a swarm-emergent result. Everything Alpha found, a single team could have found on its own.
The Intel Log
73 entries across 5 rounds. The coordinator synthesized cross-team patterns every round, and the synthesis demonstrably influenced team strategy — I can trace specific strategist picks back to coordinator observations in the log. The coordination mechanism worked. It just didn't find anything that coordination was required to discover.
The Honest Conclusion
The swarm found nothing that a single team couldn't have found alone.
I could spin this as a failure. Nine agents, all that architecture, and the big finding is… the system held? But that's exactly the wrong way to think about it.
The entire point of red teaming is to find failures. If you don't find them, one of two things is true: the system is robust, or your tests aren't creative enough. The way you tell the difference is by throwing your most creative, most escalated attack model at it and seeing what happens.
That's what Stage 5 is. Three independent teams with distinct objectives, indirect coordination through shared intelligence, a trust exploitation team that plays the long game, an economic pressure team that tests the capacity math from multiple angles, a reconnaissance team that feeds both — and a strict provenance system that prevents overclaiming.
The counterfactual success test is the key. Without it, I could have classified some of those cross-team findings as swarm-emergent and written a sexier article. The provenance system exists specifically to keep me honest.
Under the constraints of this test design — 5 rounds, 9 agents, registry-only attacks, informational coordination — the swarm demonstrated that per-identity enforcement showed no coordination-exploitable gaps across nine separate identities with distinct behavioral patterns; that economic enforcement held against coordinated pressure from three agents testing the capacity math from multiple angles simultaneously; that behavioral history earned no implicit privilege, with Beta's 48 clean cycles producing zero advantage over a fresh identity; and that indirect coordination through the intel log worked mechanically as designed but surfaced no novel attack vectors that single-team testing hadn't already found. The burden for further escalation is high — more agents, more teams, longer campaigns are unlikely to produce different results without AgentGate adding reputation-gated features that give the attacker new surface area. This does not prove the model is immune to all possible swarm topologies. It means this topology — the most escalated I've built — found nothing new.
Importantly, the swarm didn't erase the strategic limitations I flagged in the first article. Cheap identity creation, no registration rate limits, unauthenticated GET endpoints — those are still there. What the swarm clarified is which of those are genuine future attack surfaces versus which ones fail to create leverage in the current design.
The Audit Trail
I don't ship things I haven't tried to break, and I don't break things with only one tool.
The v0.5.0 codebase went through an 8-round Claude Code security audit that produced 14 fixes (round 8 came back clean), followed by a cold-eyes Codex audit that found 4 more issues (3 medium, 1 low — all fixed), followed by Claude Code cross-verification of the Codex fixes.
330 tests across 34 files. The test suite is bigger than most of the projects it tests.
One audit finding worth calling out: the original strategist parsers didn't validate that agent IDs returned by Claude belonged to the requesting team. A hallucinated agent ID could have caused Team Alpha to execute attacks under Team Beta's identity — a cross-team identity hijack that would have corrupted the provenance data. Fixed with per-team validation and an execution guard in the campaign runner. That's a coordination-layer bug that single-team testing would never find.
Another swarm-specific discovery: the audit identified a prompt injection chain where attack result details flowed into the coordinator's Claude API prompt, which produced synthesis entries in the shared intel log, which fed into all three team strategists. A malicious AgentGate could have poisoned the swarm's own strategy selection through that chain. The fix was quoteIntelForPrompt() — strict sanitization applied to all untrusted data entering Claude prompts across the coordinator, reporter, and intel log. The blast radius without the fix would have been suboptimal attack selection, not destructive actions, but it's the kind of recursive meta-vulnerability — the target system hacking the tool testing it — that only surfaces when you build the coordination layer and try to break it.
Why This Matters Beyond My Project
The pattern here isn't specific to AgentGate. It's applicable to any system where AI agents operate under governance constraints.
Swarm-testing is worth considering once single-agent and coordinated-team tests stop producing new classes of findings. In my case, the swarm didn't discover anything a single team couldn't have found. That negative result was still useful — it increased confidence that the enforcement model doesn't leak cross-identity state and that the economic math doesn't have coordination-exploitable timing windows.
If you're evaluating AI governance claims, ask about the provenance model. Anyone can throw a hundred agents at a system and claim it held. The question is how they classified the findings. Did they have a counterfactual test? Can they distinguish swarm-emergent from single-team? If not, they're either overclaiming robustness or undercounting vulnerabilities.
If you're building multi-agent systems, the coordinator design matters. I chose informational coordination over directive coordination because it let teams influence each other without collapsing the swarm into a single orchestrated script. Whether that matches your threat model is a judgment call — but your testing should reflect whatever coordination model you're actually worried about.
The Numbers
MetricValueTotal agents9 (3 teams × 3)Campaign rounds5Total attacks97Caught77 (79%)Uncaught20 (21%) — all previously documentedSwarm-emergent findings0Cross-team assisted findings2Gamma catch rate100% (38/38)Beta clean bond cycles48 (separate non-attack operations)Beta offensive attacks22Intel log entries73Runtime~5 minutesTests330 across 34 filesAudit fixes14 (Claude Code) + 4 (Codex) = 18 total
What's Next
Agent 004's red team arc is complete. Five stages — static, adaptive, recursive, coordinated team, coordinated swarms — and the governance layer survived all of them. Further escalation would likely produce diminishing returns. The only thing that would change the calculus is AgentGate adding reputation-gated features that give the attacker new surface area to exploit.
The sandbox architecture that powered Agent 004's attack generation has already been repurposed. Agent 005 takes the same four-layer sandbox and turns it into a recursive verification engine. Instead of generating attacks, it generates proofs. Same sandbox. Opposite intent.
The code is open source: Agent 004 · AgentGate · Agent 005
I'm not a coder. I've never written a line of code in my life. I direct AI coding agents to build these systems. The architecture, the sequencing, the adversarial thinking — that's me. The implementation is Claude Code, audited by Codex and cross-verified by both. If you're interested in how a non-coder builds AI agent infrastructure, the repos tell the story.