I Built an AI Agent That Runs Penetration Tests Autonomously — Here's What I Learned

3,681 lines of Python, 23 agent tools, and the question every security engineer is quietly asking: can an LLM actually hack?

Krishnendu De

~8 min read · March 27, 2026 (Updated: March 27, 2026) · Free: Yes

3,681 lines of Python, 23 agent tools, and the question every security engineer is quietly asking: can an LLM actually hack?

It started with frustration.

I was halfway through a network pentest engagement — tabbing between six terminal windows, copy-pasting nmap output into notes, trying to remember which credentials I'd already found for which services, and manually tracking which phase of the methodology I was in. The cognitive overhead wasn't the hacking. It was the bookkeeping.

So I asked myself a question that kept me up for weeks: what if the bookkeeping, the methodology tracking, the tool orchestration — what if all of that was handled by an AI agent, while I focused on the parts that actually require human judgment?

The result is Autonomous Penetration Testing Copilot — an open-source, single-file Python agent that connects to your Kali or Parrot attack box, runs 60+ security tools autonomously, stores credentials for cross-service reuse, spawns parallel sub-agents, and documents every finding. All driven by an LLM agentic loop.

This is the story of building it, the architectural decisions that mattered, and the hard lessons about what AI can and cannot do in offensive security.

The Problem with Manual Pentesting

Let me paint a picture that any pentester will recognize.

You run an nmap scan and discover that the target has SSH on port 22, Apache on port 80, and MySQL on port 3306. You note down the service versions. You run searchsploit against Apache 2.4.41. You fire up ffuf for directory brute-forcing. While that's running, you decide to test the login form for SQL injection with sqlmap. Sqlmap finds an injection and dumps credentials. Now you need to try those credentials against SSH and MySQL. But wait — you found different credentials on a different endpoint twenty minutes ago. Where did you write those down?

This isn't a skill problem. It's a state management problem. Pentesters are extraordinarily skilled humans doing extraordinarily tedious bookkeeping between moments of creative exploitation.

What if an agent could carry the state?

The Architecture: One File, Zero Excuses

The first decision was controversial, and I'd make it again: everything lives in a single Python file. All 3,681 lines. No package structure, no setup.py, no dependency tree that breaks between machines.

Here's why. Security tools run in hostile, ephemeral environments. You're SSH'd into a jump box, or you've spun up a fresh Kali VM for an engagement. The last thing you want is to troubleshoot a broken pip install chain. With one file and two dependencies (anthropic + paramiko), you can go from zero to running in under a minute:

pip install anthropic paramiko
python pentest_copilot.py --target 10.0.0.1 --local

The architecture inside that single file, however, is anything but simple.

Two Execution Engines

The agent supports two modes of command execution. SSHExecutor uses Paramiko to connect to a remote attack box — your Kali machine sitting on the engagement network. LocalExecutor uses Python's subprocess for when you're running directly on the attack machine. Same interface, same agent logic, different transport.

The LLM Provider Abstraction

I built a clean provider abstraction that supports both Claude (via Anthropic's API) and any OpenAI-compatible endpoint. This means you can run the agent with Claude Sonnet for complex reasoning, GPT-4o for speed, or even a local Llama model through Ollama. The agent doesn't care — it just needs function-calling support.

python pentest_copilot.py --target 10.0.0.1 --local \
    --provider openai --model llama3 \
    --base-url http://localhost:11434/v1

23 Agent Tools

This is where it gets interesting. The agent isn't a chatbot. It's a tool-using agent with 23 registered tools organized into four tiers:

Core tools handle the basics: executing commands, reading and writing files, running Python scripts, installing new tools, and reporting findings.

Parallelism and state tools are where the architecture diverges from simpler agent implementations. The agent can spawn background sub-agents that run concurrent tasks — say, directory brute-forcing and subdomain enumeration at the same time. It maintains a thread-safe credential vault that deduplicates and tracks discovered passwords, hashes, tokens, and API keys across the entire engagement. It can open named persistent shell sessions for context separation.

Detection and exploitation tools include a wrapper around searchsploit for CVE lookup, built-in netcat listener management, and seven reverse shell payload generators covering bash, Python, netcat, PHP, Perl, and PowerShell.

Methodology and stealth tools handle phase tracking across the five standard pentest phases, compliance mapping to OWASP Top 10, PTES, NIST 800–53, and CWE, and a stealth mode that rate-limits commands and injects IDS evasion flags.

The Agentic Loop: How It Actually Thinks

The core of the agent is a loop that will be familiar to anyone who's built with LLM function calling, but with some nuances specific to pentesting.

On each user turn, the agent builds a dynamic system prompt. This isn't a static instruction block — it's injected with live state: the current credential vault contents, active shell sessions, sub-agent status, listener status, tool detection results, and progress through the methodology phases. The LLM sees everything the agent knows, every time it makes a decision.

The loop runs up to 25 iterations per user turn. The LLM responds with either text (displayed to the user) or a tool call (executed and fed back in). Between iterations, results from completed sub-agents are automatically injected into the conversation context.

Here's the critical design decision: the system prompt tells the LLM what credentials it has available. This means when the agent finds SQL credentials on a web app, it naturally tries those same credentials against SSH and other services without being told to. Credential reuse is the number one way real attackers move laterally — and the agent mirrors that behavior because it can see its own vault.

Five Methodology Playbooks

Raw autonomy without structure is just chaos. That's why the agent ships with five methodology playbooks: web application, network, API, Active Directory, and cloud security.

Each playbook is essentially a structured prompt that guides the LLM through the standard phases of that engagement type. The web application playbook, for example, walks through reconnaissance (nmap, whatweb, wafw00f), content discovery (ffuf, robots.txt), vulnerability scanning (nikto, nuclei, sqlmap, dalfox), exploitation, and reporting.

But here's the thing — the playbooks aren't rigid scripts. They're suggestions to an agent that can reason. If the LLM discovers something unexpected during reconnaissance, it adapts. If it finds a service that the playbook doesn't cover, it improvises. The playbooks provide structure; the LLM provides intelligence.

The Credential Vault: Why State Management Changes Everything

If I had to point to one feature that elevates this from "LLM wrapper for terminal commands" to "actual pentest agent," it's the credential vault.

The vault is a thread-safe, deduplicated store for every credential the agent discovers. Passwords, NTLM hashes, JWT tokens, API keys, session cookies — they all go in. Each credential is tagged with where it was found and how, and the vault is surfaced in the system prompt so the LLM always knows what's available.

This creates emergent behavior that surprised even me during testing. The agent finds database credentials through SQL injection on a web app. Later, when it discovers an admin panel on a different port, it automatically tries those same credentials without being explicitly told to. When it enumerates SSH, it tries every credential in the vault. This isn't scripted — it's the natural result of giving an LLM visibility into its own state.

Sub-Agent Parallelism: Because Real Pentesters Multitask

A single-threaded agent is fine for simple tasks, but real pentests involve parallel workstreams. You run a slow directory brute-force in one terminal while poking at authentication in another.

The agent's sub-agent system mirrors this. The main agent can spawn background sub-agents, each with its own LLM conversation context and access to a subset of tools. The sub-agents run in separate threads. When they complete, their results — including any findings and credentials — are automatically merged back into the main agent's context.

[SUBAGENT] Spawned a1b2c3d4: directory brute-force (running in background)

The main agent continues working while the sub-agent does its thing. This is how an experienced pentester operates: multiple workstreams running in parallel, results feeding back into the overall picture.

Stealth Mode: Operating Under the Radar

Here's where security engineering experience matters more than AI engineering. Real pentest engagements often require stealth — you're testing whether the blue team can detect you, or you're operating under rules of engagement that prohibit triggering IDS alerts.

The stealth controller applies rate limiting with configurable delays and random jitter. More importantly, it automatically modifies tool flags for nine different tools. Nmap gets -sS -T2 -f for SYN stealth scans with fragmentation. Sqlmap gets --delay and --random-agent. Ffuf gets -rate 10 to throttle requests.

Toggle it with a single command:

/stealth

The agent adapts its entire toolchain to operate quieter.

Safety: Because rm -rf / Is Not a Finding

Building a tool that autonomously executes shell commands on machines requires thinking carefully about safety.

The agent includes 13 regex patterns for dangerous commands — fork bombs, rm -rf /, disk writes to /dev/sd*, firewall flushes, and shutdown commands. When it detects one, it pauses and asks for user approval before proceeding. For CI/CD pipelines, there's an --auto-approve flag that bypasses these prompts (with the assumption that the pipeline environment is disposable).

Output is truncated at 15,000 characters to prevent context window overflow, and conversation history is capped at 60 messages with automatic trimming.

What I Learned About AI in Offensive Security

After building four versions of this agent, here's what I've come to believe about AI in pentesting:

AI is excellent at orchestration. Choosing which tool to run next, parsing output, deciding on the next step based on results — this is where LLMs shine. The agent makes the same sequence of decisions an experienced pentester would make, because those decisions are pattern-matching on structured data.

AI is mediocre at novel exploitation. When a vulnerability requires creative thinking — chaining three findings together, crafting a custom payload for a weird edge case, recognizing that an unusual error message indicates a deeper issue — the LLM struggles. It can try standard approaches, but it can't replicate the intuition that comes from years of practice.

State management is the force multiplier. The credential vault, the progress tracker, the evidence collector — these systems transform the agent from "fancy terminal autocomplete" into something that maintains context across an entire engagement. This is what humans are bad at and machines are good at.

The single-file architecture was the right call. Every time I've seen someone struggle to set up a pentesting tool, it was because of dependency hell. One file, two pip packages, done.

What's Next

The agent is open-source under the MIT license. It's currently at version 2.2.0 with 23 agent tools, 20 CLI commands, and support for 60+ security tools.

The roadmap includes deeper integration with the Phalanx Cyber scanner suite (I've built companion repos for SAST, DAST, API security, and AWS security scanning), improved sub-agent coordination, and support for longer engagement sessions with persistent state across multiple runs.

If you're a security engineer, a pentester, or just someone curious about what happens when you give an LLM a Kali box and a target, I'd love for you to try it.

GitHub: github.com/Krishcalin/Autonomous-Pen-Testing

Disclaimer: This tool is designed for authorized security testing and educational purposes only. Always obtain proper authorization before conducting penetration tests. The author does not condone unauthorized access to computer systems.

Tags: #CyberSecurity #PenetrationTesting #AI #MachineLearning #Python #OpenSource #InfoSec #RedTeam #LLM #AgenticAI

#penetration-testing #red-teaming #agentic-ai #cybersecurity #application-security

I Built an AI Agent That Runs Penetration Tests Autonomously — Here's What I Learned

3,681 lines of Python, 23 agent tools, and the question every security engineer is quietly asking: can an LLM actually hack?

3,681 lines of Python, 23 agent tools, and the question every security engineer is quietly asking: can an LLM actually hack?

The Problem with Manual Pentesting

The Architecture: One File, Zero Excuses

Two Execution Engines

The LLM Provider Abstraction

23 Agent Tools

The Agentic Loop: How It Actually Thinks

Five Methodology Playbooks

The Credential Vault: Why State Management Changes Everything

Sub-Agent Parallelism: Because Real Pentesters Multitask

Stealth Mode: Operating Under the Radar

Safety: Because rm -rf / Is Not a Finding

What I Learned About AI in Offensive Security

What's Next

Reporting a Problem