I Built an AI Agent That Actually Hacks Your App — So You Can Fix It First

Static analysis finds the suspects. This tool arrests them.

Anish alx

~6 min read · April 30, 2026 (Updated: April 30, 2026) · Free: Yes

There's a dirty secret in application security that nobody likes to talk about.

Your SAST tool just flagged 91 vulnerabilities. Your team opens the report, sees the list, and collectively sighs. Because everyone knows what comes next — a week of manually clicking through every finding, spinning up the app locally, crafting payloads by hand, and trying to figure out which of these 91 "critical" issues will actually result in data walking out the door.

Most teams don't finish. They triage, they guess, they ship anyway.

I got tired of that cycle. So I built Dynamic Security Tester — an AI-powered DAST tool that takes your static analysis output and asks a simple question: does this vulnerability actually work?

The Gap Nobody Wants to Talk About

Static analysis tools (Semgrep, CodeQL, Trivy, Gitleaks — pick your flavor) are incredible at finding potential vulnerabilities. They read your code, pattern-match against known bad practices, and produce a list. But that list has a problem: false positive rates routinely hit 40–60% in real codebases.

You can't just fix everything on the list. You have limited engineering time, limited sprint capacity, and a product to ship. You need to know: which ones are real?

That's the gap. Between "the scanner said so" and "we confirmed this is exploitable in production" — there's a massive manual effort that most security teams either don't have the bandwidth for, or outsource to expensive penetration testers.

Dynamic Security Tester fills that gap automatically.

What It Actually Does

The tool takes static analysis output from any of 7 supported analyzers and runs them through an LLM-powered exploitation agent that uses real browser automation to test each finding against your live application.

Here's the pipeline in plain English:

1. Parse — Drop in your Semgrep JSON, Trivy output, CodeQL SARIF, Gitleaks findings, whatever you have. The tool auto-detects the format and normalizes everything into a unified structure. Multiple files? No problem — SHA-256 content hashing deduplicates findings across analyzers, so you don't test the same SQL injection three times just because it showed up in both Semgrep and CodeQL.

2. Prioritize — Findings get sorted into 16 vulnerability category buckets (injection, XSS, SSRF, XXE, path traversal, auth, secrets, and more) and scored by severity, confidence, and exploitability potential. The most dangerous things get tested first.

3. Exploit — An LLM agent (your choice of provider — more on that shortly) takes each finding and gets to work. It has access to 21 tools: 15 Playwright browser automation tools for real browser interaction, plus 6 exploitation workflow tools for payload generation, WAF bypass, and response analysis.

4. Classify — Every finding gets one of four verdicts: CONFIRMED, LIKELY, BLOCKED, or NOT_REPRODUCIBLE. And it doesn't just slap a label on it — it records proof, the exact payload used, the HTTP response, and how many requests were made (a built-in audit trail so the agent can't cheat by skipping the test).

5. Report — SARIF 2.1.0 for IDE integration, an HTML report for stakeholders, and JSON for CI pipelines.

The Part That Makes It Different: Proof Levels

Here's something I've never seen another tool do cleanly. Every confirmed finding gets a proof level from L0 to L4:

Level What It Means L0 No exploitation — security controls working L1 Injection point confirmed — error messages, timing differences L2 Query structure manipulated — boolean logic, UNION SELECT working L3 Data extraction proven — actual data retrieved, DB version extracted L4 Critical impact demonstrated — admin creds, command execution

L3 and L4 = CONFIRMED. L1–L2 with an external blocker = BLOCKED. L1–L2 with no blocker = LIKELY. Nothing happened = NOT_REPRODUCIBLE.

This matters enormously when you're talking to engineers. "We found a SQL injection" is a conversation. "We found a SQL injection and extracted your users table, here's the payload and the response" is a P0 incident.

Bring Your Own LLM

One design decision I'm genuinely proud of: the tool supports 7 LLM providers out of the box, and all of them speak the same interface. Swap providers with a single flag.

node src/main.js --provider=google --model=gemini-2.5-pro
node src/main.js --provider=openrouter --model=meta-llama/llama-3.3-70b-instruct:free
node src/main.js --provider=nvidia --model=moonshotai/kimi-k2-instruct

OpenAI, DeepSeek, Qwen, GitHub Copilot, Google Gemini, OpenRouter (200+ models, free tier available), and NVIDIA NIM. If you're budget-conscious, OpenRouter's free tier and NVIDIA NIM's free models make this genuinely zero-cost to run beyond the compute for your app.

The agent loop itself is provider-agnostic. The same exploitation logic, the same browser tools, the same classification system — regardless of which model is driving.

Real Browser Automation, Not Just HTTP Requests

This is where it gets interesting. The agent doesn't just fire HTTP requests at your API. It has a full Playwright browser available.

That means it can:

Fill and submit login forms to capture JWT tokens, then propagate those tokens across subsequent tests
Execute arbitrary JavaScript in the page context to detect DOM-based XSS
Wait for SPA content to load before interacting with it
Force-click elements that are hidden behind overlays or disabled states
Take screenshots as exploitation evidence

And yes, it does make direct HTTP requests too — with response timing included. That timing data is how it detects time-based blind SQL injection (the kind that doesn't return errors, just makes your DB think for 5 seconds).

browser_http_request -> POST /api/login
  payload: {"email": "' OR SLEEP(5)--", "password": "x"}
  responseTimeMs: 5243  ← That's your blind SQLi confirmation

The WAF Bypass Engine

Real applications have WAFs. Real exploitation requires getting past them.

The bypass engine generates deterministic variations for blocked payloads — encoding bypasses, technique bypasses, and WAF-specific bypasses for common products. When the agent gets a BLOCKED response, it doesn't give up. It generates a suite of bypass variants and tries again.

The response analyzer runs in parallel, detecting:

DB error signatures across 8 databases (MySQL, PostgreSQL, MSSQL, Oracle, SQLite, MongoDB, CouchDB, Cassandra)
11 WAF signatures
Boolean and timing analysis for blind injection
SSRF, XXE, and XSS indicators

Dropping This Into CI/CD

Security that lives outside the deployment pipeline is security that gets ignored. Dynamic Security Tester has a first-class CI mode:

node src/main.js \
  --ci \
  --target=http://localhost:3000 \
  --results=semgrep.json,trivy.json \
  --output=./security-output \
  --fail-on-likely

Exit codes are clean:

0 — pass, nothing actionable
1 — fail, confirmed exploits found
2 — error, scan failed

Add --fail-on-likely if you want to block builds on probable issues too. Add --fail-on-blocked for maximum paranoia. The machine-readable ci-report.json gives your pipeline everything it needs to make the call.

{
  "summary": {
    "total": 91,
    "confirmed": 3,
    "likely": 2,
    "blocked": 1,
    "notReproducible": 85
  },
  "exitCode": 1,
  "exitReason": "FAIL: 3 CONFIRMED exploit(s) found"
}

91 findings in. 3 actual problems to fix. That's the signal-to-noise ratio shift this tool is designed to create.

Getting Started in 5 Minutes

git clone https://github.com/anishalx/dynamictester.git
cd dynamictester
npm install
npx playwright install chromium
# Set your API key (OpenAI shown, but any provider works)
export OPENAI_API_KEY="sk-your-key"
# Run it
node src/main.js

The interactive CLI walks you through the rest — point it at your analyzer output, give it your target URL, pick a vulnerability category, and watch it work.

What This Isn't

I want to be clear about scope. This is not a replacement for a skilled penetration tester. It won't find vulnerabilities your static analyzers missed. It won't do business logic testing. It doesn't know your application's domain the way a human researcher does.

What it is — is a force multiplier. It takes the grunt work of validating static analysis findings and automates it. It gives your security team (or your solo dev wearing the security hat) the ability to triage 90 findings in the time it used to take to manually test 10.

The 3 confirmed exploits in a sea of 91 findings? Those are the ones that matter. This tool finds them for you.

The Stack

For the curious: Node.js 18+, Playwright for browser automation, a provider-agnostic OpenAI-compatible client interface for LLM calls, Vitest for 133 unit tests covering the bypass engine, payload generator, and response analyzer. The whole thing is structured for extensibility — adding a new parser, provider, or browser tool is a documented, straightforward process.

Try It, Break It, Contribute

The project is open source on GitHub at github.com/anishalx/dynamictester.

If you work with a static analyzer that isn't supported yet, the parser interface is designed to make adding one straightforward. Same for LLM providers. Pull requests are welcome — especially for new prompt templates, additional browser tools, and test coverage.

Security tooling that's actually useful is rare. I built this because I needed it. Hopefully you do too.

Built by Anish — final-year CS student, cybersecurity enthusiast, and person who got very tired of false positives.

Questions? Drop them in the comments or find me on LinkedIn.

#cybersecurity #hacking #networking #ai #ai-agent

< Go to the original