ClaudeTab: What Happens When AI Reads Your Proxy Traffic

AI-assisted pentesting in practice: where it helped, where it failed, and what I learned.

Saksham Jaiswal

~5 min read · March 10, 2026 (Updated: March 10, 2026) · Free: Yes

Everyone is talking about agentic AI and autonomous security agents.

Most of that conversation is framed around enterprise automation, pipeline orchestration, or replacing headcount.

That wasn't my goal.

I wanted to answer a much simpler question:

What happens if AI lives inside my proxy while I test?

So I built ClaudeTab — a native Burp Suite extension that embeds Claude directly into the testing workflow. No context switching. No copy-pasting requests. Just AI with direct access to session traffic.

This wasn't a product experiment.

It's a learning experiment

Here's what I learned: AI is useful, but only when a human stays in the loop.

Burp MCP already exists, it lets Claude Desktop/Code talk to Burp Suite over the Model Context Protocol. It works well. I wanted Claude in the session — reading the same traffic I'm reading, in the same tool. That's the only reason ClaudeTab exists.

github.com/n0tsaksham/ClaudeTabreleases/ClaudeTab.jar → Burp Extensions → Add → Java → paste API key

Setup — give Claude a brain before it tests

The most important thing before running any scan is the engagement brief a CLAUDE.md file that acts as Claude's memory for the session. Not a list of expected vulnerabilities. Not a 10-line summary. A proper brief that tells Claude:

What the app does — user roles, data it handles, business flow
What's in scope — exact URLs, excluded paths
How auth works — login endpoint, token type, session behavior
What you're looking for — focus areas (payments, IDOR on user resources, etc.)
What "normal" looks like — expected responses so Claude knows what anomalies to flag

The better the brief, the sharper the scan.

The brief is injected into every Claude call. It's the difference between a focused scan and a noisy one.

One click. Loaded and active for the session.

The Agent Scan — 3 phases, not one big blast

I ran Agent Scan on crAPI. Two test accounts, the brief above, nothing else. Here's how it actually runs:

Phase 0 — Application map (before any testing)

Claude reads the full Burp site map first every endpoint Burp has ever seen, deduplicated by path pattern (/users/1 and /users/2 become /users/{id}). It maps the tech stack, auth flow, user roles, and data models before firing a single payload.

Phase 0 output: full app map, auth flow identified, CORS found on turn 3.

Phase 1 — Authenticate

If credentials are in the brief, Claude finds the login endpoint from the site map, sends the POST, extracts the JWT, handles MFA skip if needed, and uses the token for all subsequent requests.

Phase 1 Output: Authentication via signup and login verified.

Phase 2 — Targeted scan per brief

Now it tests. Ranked by true-positive likelihood — IDOR candidates first, then injection endpoints, then auth flows. Each finding informs the next test.

Phase 2 Output: ClaudeTab findings summary table and exports the report.

Confirmed issues included:

— NoSQL Injection — BOLA on Orders — SSRF via Mechanic API — Vehicle GPS access control weakness — OTP brute force — User enumeration — Excessive data exposure — CORS misconfiguration

The important part wasn't that it found bugs. It was how they were verified and where AI struggled.

Proof of Concept — NoSQL Injection

Baseline request:

POST /community/api/v2/coupon/validate-coupon
{"coupon_code":"Test"}

Response:

HTTP 500 Internal Server Error

Tampered request:

POST /community/api/v2/coupon/validate-coupon
{"coupon_code":{"$ne":"invalid"}}

Response:

HTTP 200 OK
Coupon: TRAC075
Amount: 75

The $ne operator was passed directly into the MongoDB query.

Confirmed NoSQL injection. Claude flagged it, but I still verified it manually in Repeater.

AI confirmation is not exploitation confirmation.

Proof of Concept — SSRF via Mechanic API

Endpoint:

POST /workshop/api/mechanic/report

The API accepted arbitrary URLs.

The first AI pass missed the SSRF vector because it didn't understand the internal container naming convention.

Once contextualized, manual verification confirmed internal interaction.

AI is only as good as the environmental context you give it.

PoC: Report generated by ClaudeTab after scan completion.

Where Claude failed — and what it tells you

It initially missed the SSRF vector. Why? Because it lacked infrastructure awareness.

I asked Claude to run a scan on evil.com. It wasn't in the brief. It wasn't in Burp's proxy history. Claude refused:

"No. I will not provide attack guidance for evil.com. I've declined this three times. My answer will not change. 1. Not in your scope — authorized testing is limited to crAPI 2. No ownership/authorization demonstrated 3. Repeated requests raise concerns"

That's correct behavior but it reinforces a rule: If traffic never hits your proxy, AI doesn't know it exists.

Scope discipline and traffic visibility still control everything.

The lesson: The biggest surprise wasn't vulnerability discovery. It was how dependent accuracy was on the engagement brief quality. A well-written brief improved signal more than additional scanning ever did. More context > more automation.

That's the real takeaway: severity is still a human judgment call

What Changed in My Workflow

Before this, most of my time went into recon and structural coverage. Mapping endpoints, checking parameter behavior, comparing role responses — the kind of work that isn't difficult, just mentally expensive. By the time I got to chaining or impact analysis, I'd already burned a lot of focus on repetitive mutation loops.

With Claude handling the surface-level enumeration and pattern spotting, I found myself reaching the "interesting" part of the test faster. I was thinking about exploit paths and business impact earlier instead of still counting endpoints.

Manual verification didn't go away. If anything, it became more intentional. The difference was that I wasn't spending energy on mechanical coverage — I was spending it on judgment.

It didn't replace me. It reduced friction. That's meaningful.

Most AI discussions in security focus on automation scale or headcount reduction.

That's not what this experiment was about. This was about making AI useful inside a live test session.

AI cannot:

Reason about business impact
Understand infrastructure nuance without context
Replace judgment

In one case, it missed SSRF because it lacked environmental knowledge.

That's acceptable. Its strength is volume.

If it absorbs repetitive structural work and allows deeper analytical focus, it's useful.

This isn't about replacing tools or testers. It's about reducing friction inside the workflow.

Originally published at https://n0tsaksham.github.io.

#application-security #claude-code #pentesting #api-security #artificial-intelligence