pentest-ai: The Open Source Pentest Tool That Proves Every Finding Before It Reports It

Security scanning tools have a trust problem. Not the tools themselves, but the relationship between security teams and the output those tools produce. A typical ZAP or Nuclei run against a modern single-page application produces hundreds of findings. Triaging those findings takes hours. A significant proportion turn out to be false positives. The team that started the day hoping to understand their application's security posture ends the day having confirmed that their scanner is noisy. After enough cycles of this, the findings stop being read carefully. The tool becomes background noise.

This is not a hypothetical. In a published comparison against OWASP Juice Shop, a heavily-studied vulnerable application, ZAP 2.17.0 produced 593 findings with a forty-seven percent false-positive rate. Zero of those findings were rated Critical or High. Nuclei 3.8.0 produced one finding. pentest-ai 0.13.0 produced eighty-eight findings, forty-six of them Critical or High, with a zero percent false-positive rate.

The difference is not cleverness in the probes. The difference is the oracle gate. pentest-ai does not report a finding until a machine oracle has re-run the exploit and confirmed it. A candidate finding stays a candidate until it earns a VERIFIED badge through independent reproduction. What reaches the report is what was proven, not what was suspected.

The Trust Spine: Why Oracle Verification Changes Everything

The central architectural principle in pentest-ai is the Trust Spine, the enforcement layer that ensures every verified finding comes from a named machine oracle rather than from an LLM assertion.

The distinction matters because LLMs are excellent at reasoning about security findings but cannot be trusted as the primary source of ground truth about whether a vulnerability actually exists. An LLM that has read extensive security literature can plausibly assert that a given HTTP response pattern looks like it might be vulnerable to SQL injection. That assertion is useful input for directing probe execution. It is not sufficient evidence that the application is actually vulnerable.

pentest-ai enforces this distinction in code. A verdict that cannot name its oracle is rejected. The oracles that ship with the current release cover eight bug classes: reflection, open redirect, insecure direct object reference and broken object level authorization, error disclosure, MCP exposure, SQL injection in both boolean-based and blind forms, and out-of-band server-side request forgery and XML external entity injection via a self-hosted out-of-band application security testing collaborator.

Prove-or-kill gating extends this principle to third-party scanner output. When nuclei, nikto, ZAP, or any other external scanner produces a finding, pentest-ai holds that finding back from the report until an oracle re-proves it. The scanner output is used as a signal that something might be worth investigating, not as a finding in its own right.

Portable proof capsules make the verification reproducible. Every VERIFIED finding ships with a capsule that encodes the exact conditions under which the exploit was confirmed. Running ptai replay against any capsule replays the exploit live and shows the verdict flip from candidate to VERIFIED on screen. The proof is not a screenshot in a report. It is a machine-executable demonstration.

The demo makes this concrete:

pip install ptai && ptai demo

pip install ptai && ptai demo

This single command scans a bundled vulnerable application, reports four findings including three oracle-verified ones, replays one finding live from its proof capsule showing three successful replays out of three, and then runs the same scan against a hardened version of the same application and reports zero findings. The only thing that changed between the two runs is the fix. The findings appear and disappear with the vulnerability, not because the tool went quiet.

Installation and the Three Integration Paths

pentest-ai installs from PyPI:

pip install ptai

pip install ptai

Three distinct integration paths accommodate different deployment contexts, and which path applies determines whether an API key is needed.

The first and recommended path is through Claude Code via MCP. Users who already pay for a Claude Pro, Max, or Team subscription can wire pentest-ai in as an MCP server without any additional API key:

claude mcp add pentest-ai -- ptai mcp

claude mcp add pentest-ai -- ptai mcp

After restarting Claude Code, running a pentest is a natural language instruction to the assistant. The MCP server exposes forty-nine tools covering the full engagement lifecycle: listing and running probes, wrapping external security tools, managing engagement state, executing HTTP requests under a hard scope guard, and retrieving findings and attack chains. The LLM calls these tools through the same mechanism it uses for any other tool-calling task. The security work happens locally against the target. Prompts and tool output go through the Anthropic API in the same way any Claude Code session does.

The second path works for any other MCP-compatible client: Cursor, VS Code Copilot, Codex, Claude Desktop, and others. A setup command auto-detects every MCP-compatible client installed on the machine and writes the appropriate configuration files:

ptai setup --mcp

ptai setup --mcp

After restarting the client, the same forty-nine tools are available through whatever interface that client provides.

The third path covers contexts where no MCP client is available: CI and CD pipelines, scheduled jobs, air-gapped terminals, and headless automation. Here the standalone CLI drives its own LLM, and an API key is required. The key can come from Anthropic, OpenAI, a local Ollama instance, or any of more than three hundred models available through the LiteLLM provider integration:

export ANTHROPIC_API_KEY=sk-ant-...
ptai start https://your-target.com

export ANTHROPIC_API_KEY=sk-ant-...
ptai start https://your-target.com

For fully local operation with no cloud dependency:

export PENTEST_AI_LLM_PROVIDER=ollama
ptai start https://your-target.com

export PENTEST_AI_LLM_PROVIDER=ollama
ptai start https://your-target.com

The standalone CLI path includes a spending cap, defaulting to ten dollars per engagement, to prevent runaway loop costs. The cap is configurable through an environment variable and engagement state is preserved if the cap fires, allowing the engagement to be resumed after raising the limit:

export PTAI_PRICE_LIMIT=25
ptai resume <engagement_id>

export PTAI_PRICE_LIMIT=25
ptai resume <engagement_id>

What the Engagement Actually Looks Like

An engagement in pentest-ai follows a structured phase pipeline that moves from reconnaissance through reporting. The pipeline runs the same way whether the LLM is present or not; the LLM reasons about results and coordinates phase transitions, but the actual detection work is done by deterministic probes and tool wrappers.

Phase one is reconnaissance: port scanning, DNS and subdomain enumeration, and service fingerprinting. Phase two covers authenticated web testing following the OWASP Testing Guide v4, API surface analysis covering OpenAPI, GraphQL, and REST endpoints against the OWASP API Top 10, and browser automation driven by Playwright for DOM analysis, XHR capture, and security header grading. Phase three addresses Active Directory enumeration, Kerberoasting, BloodHound pathfinding, and delegation abuse checks. Phase four covers cloud infrastructure: AWS, Azure, and GCP IAM analysis, misconfiguration checks, Kubernetes RBAC review, and serverless surface analysis. Phase five includes credential testing and local and lateral privilege escalation analysis. Phase six chains findings into multi-step attack paths. Phase seven runs non-destructive proof-of-concept validation for each finding. Phase eight generates detection rules for the blue team in Sigma, Splunk Processing Language, and Kusto Query Language formats. Phase nine produces the final report.

The key property of this pipeline is that authentication is maintained throughout. Most scanners cannot hold a session. They send a request, get a response, and move on. pentest-ai logs in at the authentication phase, maintains the session, refreshes credentials when they expire, and every downstream tool and probe inherits the authenticated session cookie. This is what enables finding vulnerabilities in authenticated areas of an application, which is where most of the interesting bugs actually live.

The Probe and Tool Library

pentest-ai wraps more than two hundred external security tools through a unified interface. The installation model for these tools is practical: at engagement start, the planner predicts which tools the engagement will need and requests installation of any that are missing in a single prompt. The answer is persisted so subsequent engagements do not repeat the question. Alternatively, tools can be installed in tiers before the engagement begins:

ptai setup --tier core         # approximately six essentials, about thirty seconds
ptai setup --tier recommended  # adds fuzzers, crawlers, and password tools
ptai setup --tier full         # everything, approximately thirty minutes

ptai setup --tier core         # approximately six essentials, about thirty seconds
ptai setup --tier recommended  # adds fuzzers, crawlers, and password tools
ptai setup --tier full         # everything, approximately thirty minutes

The tool library includes nmap, masscan, nuclei, ffuf, sqlmap, gobuster, wapiti, nikto, dalfox, xsstrike, wpscan, hydra, hashcat, enum4linux, bloodhound-python, the impacket suite, trufflehog, gitleaks, kube-hunter, trivy, prowler, scout-suite, and more than one hundred and eighty others. More than four thousand Nuclei templates are integrated for atomic vulnerability detection alongside the sixty curated web probes that form the core of the SPA testing capability.

The sixty curated probes cover the OWASP Top 10 and OWASP API Top 10 bug classes with SPA-aware detection logic. Standard crawl-based scanners struggle with modern single-page applications because the crawl-based approach cannot account for client-side routing, dynamic content loading, and authentication-dependent rendering. The curated probes are designed for this environment, targeting vulnerability classes at the HTTP and application layer rather than relying on crawl coverage.

Out-of-band detection covers blind vulnerability classes that cannot be detected through in-band response analysis. Blind server-side request forgery, blind SQL injection, blind XML external entity injection, blind stored cross-site scripting, server-side template injection, and Log4Shell all require the target application to make an outbound callback to an attacker-controlled server before the vulnerability can be confirmed. pentest-ai integrates with ProjectDiscovery's oast.fun infrastructure by default, with each engagement generating a fresh RSA-2048 keypair so that interaction contents are encrypted to the local process and unreadable to the collaborator infrastructure. For paid engagements or programs that require callback infrastructure on tester-controlled hosts, pointing pentest-ai at a self-hosted Interactsh server is straightforward:

ptai start http://target --oast-server https://oast.example.com --oast-token <token>

ptai start http://target --oast-server https://oast.example.com --oast-token <token>

CI Integration and Report Formats

For AppSec teams embedding security testing in their development pipeline, pentest-ai integrates with GitHub Actions through a straightforward workflow configuration:

name: Security scan
on: [pull_request]
jobs:
  ptai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install ptai
      - run: |
          ptai start ${{ vars.STAGING_URL }} \
            --ci \
            --fail-on high \
            --sarif pentest.sarif
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: pentest.sarif

name: Security scan
on: [pull_request]
jobs:
  ptai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install ptai
      - run: |
          ptai start ${{ vars.STAGING_URL }} \
            --ci \
            --fail-on high \
            --sarif pentest.sarif
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: pentest.sarif

This configuration runs an authenticated scan against the staging environment on every pull request. Findings post as a PR comment. The SARIF output uploads to GitHub Code Scanning. The build fails on any high-severity finding that passes the oracle gate. Only verified findings can fail the build, which means the severity gate is trustworthy rather than being a source of false-positive build failures that train developers to ignore security gates.

GitLab CI and Jenkins pipeline templates are available in the documentation alongside advanced options for authentication profiles in CI, cost gates, and scope file configuration.

Six output formats are supported: Markdown for human-readable reports, HTML for browser viewing, PDF for client deliverables, SARIF 2.1.0 for tool integration with code scanning platforms, JUnit XML for test framework integration, and compliance mappings that tag findings against OWASP, CWE, CVE, and CVSS v3.1.

Playbooks: Repeatable Methodology as Code

For red teams and consultants who run the same methodology repeatedly across different engagements, pentest-ai supports YAML playbooks that encode an engagement methodology as a version-controlled file:

name: internal-ad-pentest
inputs:
  domain: { required: true, prompt: "AD domain" }
  dc_ip:  { required: true, prompt: "DC IP" }
phases:
  - id: recon
    tools: [nmap, masscan]
  - id: ad-enum
    depends_on: [recon]
    condition: "any_finding(type='open_port', port=445)"
    tools: [enum4linux, ldapsearch, bloodhound-python]
  - id: kerberoast
    requires_finding: { type: ad_user_enumerated }
    tools: [impacket-getuserspns]
    llm_decide: true

name: internal-ad-pentest
inputs:
  domain: { required: true, prompt: "AD domain" }
  dc_ip:  { required: true, prompt: "DC IP" }
phases:
  - id: recon
    tools: [nmap, masscan]
  - id: ad-enum
    depends_on: [recon]
    condition: "any_finding(type='open_port', port=445)"
    tools: [enum4linux, ldapsearch, bloodhound-python]
  - id: kerberoast
    requires_finding: { type: ad_user_enumerated }
    tools: [impacket-getuserspns]
    llm_decide: true

The playbook format supports dependency declarations between phases, conditional execution based on findings from earlier phases, required finding prerequisites that gate downstream phases, and an llm_decide flag that delegates the execution decision for a phase to the LLM based on accumulated context. Five built-in playbooks ship with the tool covering common engagement types. Custom playbooks can be stored in any directory and executed directly:

ptai playbook run ./my-ad.yaml

ptai playbook run ./my-ad.yaml

The ability to version-control methodology and share it across a team means that every engagement runs the same process. Findings are comparable across engagements because the methodology was consistent. New team members run the same probes and chains that experienced practitioners have refined over many engagements.

Scope Controls and Responsible Use

pentest-ai is offensive security tooling. It executes real network and host operations against the targets it is pointed at. The documentation is explicit: by installing or running the tool, users agree to the Acceptable Use Policy and Terms of Service, and testing systems without explicit written authorization may violate the Computer Fraud and Abuse Act, the Computer Misuse Act 1990, GDPR Article 32, and equivalent legislation in other jurisdictions.

The built-in scope controls exist to make authorized tests stay within their authorized boundaries. The strict scope setting refuses all off-host requests and stops following redirects to out-of-scope hosts. The safe intensity setting skips state-mutating probes that could cause data loss or service disruption. Rate limiting respects 429 responses and Retry-After headers. Non-destructive proof-of-concept validation confirms vulnerability existence without exploiting it destructively.

The first run of the tool prompts for explicit acceptance of the Acceptable Use Policy and persists the choice locally. For CI environments where interactive prompts are not appropriate, the acceptance can be set through an environment variable.

Human-in-the-loop teleoperation allows an operator to take over an engagement mid-run through a keyboard shortcut. This is the implementation of the principle that pentest-ai is built for the human-assisted regime rather than the fully autonomous one. Research on fully autonomous LLM pentest agents shows completion rates of twenty-one to thirty-one percent for end-to-end pentest tasks. Human-assisted approaches reach sixty-four percent. pentest-ai is built for that second number: the LLM coordinates and reasons, the curated probes detect, and the operator makes the calls that require judgment.

The Seventeen Agents Under the Hood

The engagement pipeline is implemented as seventeen specialized agents that each own a specific phase of the security testing workflow.

The recon agent handles reconnaissance: port scanning, DNS enumeration, subdomain discovery, and service fingerprinting. The web agent runs the authenticated OWASP Testing Guide v4 pass against the target. The API security agent handles OpenAPI, GraphQL, and REST surface analysis against the OWASP API Top 10. The browser agent drives Playwright for DOM analysis, XHR capture, and security header grading. The AD agent covers Active Directory enumeration, Kerberoasting, BloodHound pathfinding, and delegation abuse detection. The cloud agent handles AWS, Azure, GCP IAM analysis, misconfiguration checks, Kubernetes RBAC review, and serverless surface analysis.

The credential tester agent handles password spraying, credential stuffing, and MFA bypass checks. The privilege escalation agent provides local and lateral privilege escalation analysis from collected context. The vulnerability scanner agent runs cross-cutting vulnerability aggregation against the findings database. The exploit chain agent correlates findings into multi-step attack paths. The proof-of-concept validator agent runs non-destructive proof-of-concept confirmation for each finding. The detection agent generates Sigma, SPL, and KQL rules for the blue team. The report agent produces output in all six supported formats.

Four optional agents extend the scope beyond web application testing: the LLM red team agent runs OWASP LLM Top 10 probes against chatbot and AI application endpoints, the social engineering agent generates phishing corpus and pretext material for authorized social engineering exercises, the mobile agent handles Android and iOS static and dynamic analysis, and the wireless agent covers wireless reconnaissance and handshake capture.

Each agent runs with the LLM when a key is configured, or as a deterministic tool loop when no key is available. The phase order and the probe behavior are identical in both cases. The LLM adds reasoning about results, prioritization of follow-up probes, and synthesis of findings into coherent attack chains. The core detection capability does not depend on it.

Conclusion

pentest-ai represents a substantively different approach to automated security testing. The market for security scanning tools has long accepted a trade-off between coverage and precision, with high-coverage scanners generating noise that trains teams to ignore findings and low-noise scanners producing minimal output that misses real vulnerabilities. The oracle gate changes the terms of that trade-off. By running every exploit to confirm it before reporting, pentest-ai achieves both high recall and zero false-positive precision, verified on both a private honeypot benchmark and a published comparison against established tools.

The MCP integration brings this capability into the AI-assisted development workflows that are becoming standard practice. A developer using Claude Code or Cursor who adds pentest-ai as an MCP server gains access to forty-nine security testing tools callable through natural language, driven by a curated probe library that finds real vulnerabilities rather than generating candidate findings that require human triage.

The responsible use framework, the scope controls, the non-destructive proof-of-concept validation, the spending caps, and the human-in-the-loop teleoperation are not afterthoughts. They reflect the principle that useful offensive security tooling must be trustworthy and bounded, not just capable.

The repository is available at: https://github.com/0xSteph/pentest-ai

Contents