When AI Coding Agents Break the Boundary: Lessons from Cursor, Claude and Codex

What can happen when an agent encounters malicious input and incomplete controls and how Claude's agents Code, Codex and Cursor behave

Marco Mastrodonato

~9 min read · April 30, 2026 (Updated: April 30, 2026) · Free: Yes

Introduction

AI coding agents are becoming part of everyday software development. They can read repositories, edit files, run commands, use tools, and sometimes interact with remote environments. This makes them powerful assistants, and it also changes the security model. Untrusted text inside a repository can influence an agent that has real access to a developer's machine.

The Cursor NomShub incident is a useful case study because it shows how prompt injection, shell execution, sandbox design, and remote access features can combine into a serious attack chain. The lesson reaches beyond one product. As coding agents become more autonomous, security depends on where we place boundaries, how strongly those boundaries are enforced, and how much privilege we give the agent by default.

1. The Cursor incident: what actually happened

The case that sparked the discussion is the vulnerability chain called NomShub, disclosed by Straiker. The core idea is simple to describe and serious in its implications: a malicious repository could contain hidden instructions inside seemingly harmless files such as a README.md. When Cursor read that content through its agent, the agent could be influenced to execute actions on the user's system. Straiker describes a chain involving indirect prompt injection, a shell command parser bypass, and abuse of Cursor's remote tunnel feature, ultimately leading to persistent shell access on the developer's machine just by opening the repository. Cursor addressed the issue in version 3.0.

Looking more closely, the first step was an indirect prompt injection. The agent processed repository content that appeared to be documentation or supporting text, yet contained instructions crafted to be interpreted by the model as actionable directives. This matters because the issue is not a classic memory corruption exploit, but a boundary problem between untrusted content and executable instructions handled by an agent with real privileges.

The second step was a sandbox bypass. According to Straiker, Cursor's parser focused mainly on external executables while remaining blind to several shell builtins such as export, cd, and similar commands. This allowed the agent to modify the shell execution context without being blocked. The distinction is important: no unusual binary was required. Combining built-in shell primitives was enough to move beyond the intended constraints.

On macOS, the chain became even more impactful because the sandbox allowed writes to parts of the home directory. Straiker showed that the agent could modify ~/.zshenv, which is executed by every new Zsh instance. This introduced persistence, meaning malicious code could run again beyond the initial execution. At that point, the issue was no longer just an unintended command execution. It became a durable execution mechanism embedded in the system.

The final step was the remote tunnel hijack. The agent could be guided to start Cursor's built-in tunnel, generate the required GitHub authorization code, and send it to the attacker. This enabled persistent remote shell access through Microsoft Dev Tunnels. From a defensive perspective, detection was difficult because the traffic appeared as legitimate communication over common cloud infrastructure.

One of the most instructive aspects of this incident is that no additional explicit user interaction was required beyond opening the repository. There was no need to run a binary, approve an unusual prompt, or install external software. This makes the case particularly useful from a learning perspective, as it shows how the agent-read-execute chain can compress multiple risk steps into a flow that feels normal to the user.

2. The underlying structural problem

The Cursor bug is one example of a wider architectural pattern. Modern coding agents read untrusted input, interpret it with an LLM, and then act through tools that can modify files, run commands, access networks, and interact with integrations. The attack surface grows from the combination of those capabilities.

Prompt injection in agents resembles a confused deputy problem. The user gives the agent a legitimate goal, then the agent encounters content controlled by someone else. That content tries to redefine priorities, constraints, and goals. When the agent holds meaningful privileges, hostile text can influence real actions.

Anthropic's description of Claude Code Auto Mode is useful here because it acknowledges the shape of these failures. The documented risk categories include destructive or exfiltrating actions, degrading security posture through persistence or logging changes, crossing trust boundaries, scanning credential stores, and bypassing review on shared resources. These are operational security failures, not just bad model outputs.

A useful distinction emerges: tools differ in where they place the security boundary. Some emphasize infrastructure-level isolation through sandboxing, filesystem limits, disabled network access, and explicit approvals. Others add semantic controls that classify whether an action matches user intent. Both approaches are valuable, and they behave differently under ambiguous or malicious input.

Isolation is a technical boundary. Risk classification is a semantic boundary. Technical boundaries are easier to reason about because they constrain what can physically happen. Semantic boundaries are smoother for the user, because they allow more actions to proceed automatically, yet they depend on interpretation.

3. How to mitigate the risk

The first mitigation is reducing default privileges. An agent should operate in a sandbox with filesystem access limited to the workspace, with network access disabled or tightly restricted, and with permissions granted only for the task at hand. OpenAI documents Codex defaults as workspace-limited with network access disabled unless explicitly enabled. Anthropic's Claude Code sandboxing documentation similarly emphasizes filesystem and network boundaries enforced through operating-system primitives.

The second mitigation is keeping a human approval step for actions with meaningful side effects, especially for untrusted repositories, shared environments, credentials, deployments, migrations, configuration changes, and commands that extend beyond the local boundary. Claude Code, in its default mode, asks for approval for edits, shell commands, and network requests. Codex uses an on-request approval policy when crossing sandbox boundaries. These checkpoints interrupt many automated attack chains.

A third mitigation is treating all external content as untrusted, even when it looks like documentation. README files, issues, wikis, commit messages, tool outputs, web pages, and logs can all act as steering vectors. Anthropic addresses this by using a two-layer defense in Auto Mode: an input probe to detect injection patterns and an action classifier that evaluates decisions independently of raw tool output.

A fourth mitigation is separating environments and secrets. Autonomous tasks should run in containers or disposable environments, using temporary or scoped credentials, minimal privileges on Git and cloud systems, and monitoring outbound traffic. Both Anthropic and OpenAI emphasize defense in depth and controlled environments for higher-risk operations.

A fifth mitigation is ensuring observability of agent actions. Teams need audit trails, approval records, command logs, diff reviews, and alerts for unusual network or tunnel behavior. This is less glamorous than model capability, yet it is what lets a team understand what happened after an agent takes a risky action. Operational maturity is as important as model capability.

The stronger the autonomy, the more important the operational record becomes.

4. Default configurations: Cursor, Claude Code, Codex CLI

Looking at default configurations, Codex CLI appears the most oriented toward infrastructure containment. In the Auto preset, including --full-auto, Codex can read files, edit files, and run commands in the working directory automatically. It still asks for approval to edit outside the workspace or use the network. The --full-auto option is a convenience alias for workspace-write plus on-request approvals, while dangerous full access requires a separate bypass mode.

Claude Code, by default, emphasizes human approval. Default mode allows reads without prompting, while edits, shell commands, and network requests require approval. It also supports sandboxing for shell commands, with separate controls for permissions and execution environments. The result is a workflow centered on user oversight.

Cursor presents a more complex picture because it operates as a local editor, an agent platform, a cloud-agent system, and a remote-development environment. Cursor's agent can read files, edit files, run terminal commands, use web tools, and interact with other tools. Cursor's security documentation says sensitive actions require manual approval by default, terminal commands need approval by default, and arbitrary network requests are restricted in default settings. It also warns that allowlists are best-effort and that "Run Everything" skips safety checks.

Cursor's cloud-agent model has different tradeoffs. Cursor documents that cloud agents run in isolated VMs, have internet access by default, and auto-run terminal commands so they can iterate on tests. The same documentation explicitly calls out prompt-injection-driven data exfiltration risk and provides network access modes to restrict outbound domains. This reinforces the broader point: the relevant security posture depends heavily on the specific mode, environment, and configuration.

In a comparative view under typical usage, Codex CLI offers a clear technical boundary, Claude Code provides strong action-level user control, and Cursor delivers a highly composable platform with multiple execution modes. Each approach reflects a different balance between control, flexibility, and workflow speed.

5. More autonomous modes

In more autonomous configurations, the differences become clearer.

Claude Code Auto Mode is Anthropic's attempt to reduce approval fatigue while preserving a safety layer. A two-stage classifier that evaluates agent actions and replaces part of the human approval process. It blocks categories such as exfiltration, persistence, boundary crossing, and risky infrastructure changes. Anthropic reports measurable false positive and false negative rates, acknowledging that the system works best as a safer alternative to fully skipping permissions rather than as a complete replacement for careful review. Numbers are: a 17 percent false negative rate on real overeager actions and a 5.7 percent false negative rate on synthetic exfiltration tasks.

This approach is technically sophisticated, yet it introduces a probabilistic element into the security boundary. The system's effectiveness depends on how accurately the classifier interprets intent and context.

Independent evaluations already exist. The paper Measuring the Permission Gate tests Auto Mode in deliberately ambiguous authorization scenarios. It reports an 81.0 percent end-to-end false negative rate under that stress workload, with a major coverage gap around in-project file edits that did not pass through the classifier. The paper frames this as a different workload from Anthropic's production evaluation, which is an important distinction for readers.

For Codex CLI automation, the concept of "auto mode" is more constrained. The --full-auto setting still operates within workspace-write and on-request. Full privilege escalation requires explicit configuration. This means that even in automated workflows, Codex relies primarily on sandbox boundaries and approval policies rather than semantic classification.

For Cursor, increased autonomy comes through features such as auto-run, cloud agents, and remote environments. These enable powerful workflows, yet they also raise the importance of clearly defined execution boundaries. The NomShub case illustrates how risks can grow quickly when reading, decision-making, and execution are not sufficiently separated.

In summary, Claude Auto Mode represents an ambitious attempt at intelligent action governance. Codex CLI remains anchored in sandbox-based containment even during automation. Cursor provides a flexible and powerful agent platform, with risk shaped by how its capabilities are combined.

6. Finding the right balance for a development team

For a typical engineering team, the practical lesson is straightforward. Coding-agent risk depends less on model quality and more on what privileges the agent has, where it runs, how isolated it is, what it can read, what it can modify, who approves its actions, and how observable those actions are. The Cursor incident makes this visible, yet the principle applies across all similar tools.

A solid baseline policy prioritizes tight sandboxing, limited network access, scoped or temporary secrets, human approval for high-impact actions, and full autonomy only in controlled or disposable environments. Classifier-based autonomy, such as Claude Auto Mode, can reduce operational friction, though it should be treated as an additional safety layer rather than a replacement for technical boundaries.

Conclusion

The Cursor NomShub incident shows what happens when an agent that can read repositories, execute shell commands, and use remote features encounters malicious input and incomplete controls. The structural issue applies to the entire category of coding agents.

Effective mitigation combines least privilege, sandboxing, approvals, secret isolation, and observability. In default configurations, Codex CLI leans toward technical containment, Claude Code toward human-in-the-loop control, and Cursor toward composability and flexibility. In more autonomous setups, Claude explores probabilistic governance, Codex stays anchored to sandbox policies, and Cursor requires careful attention to how its capabilities are combined.

The deeper lesson is simple: once an AI agent can act, text becomes part of the control plane. Security design has to treat it that way.