LLMs and Vulnerabilities: What This Means for Engineers Building Systems Today

Anthropic released an article last month about how LLMs can be used to build secure code. Now, what this means is that as an engineer, whether software or AI building systems, you know that each application is prone to vulnerabilities. There could be some gaps in the system that attackers or users can exploit. The word "vulnerabilities" is something I have commonly heard developers use, and I took some time to do a deeper study to understand, first, what it means to have a vulnerability in your system.

The National Cyber Security Center defines it as follows:

"A vulnerability is a weakness in an IT system that can be exploited by an attacker to deliver a successful attack. They can occur through flaws, features or user error, and attackers will look to exploit any of them, often combining one or more, to achieve their end goal."

This means that each organisation that builds systems in production must ensure its applications are not vulnerable.

There is also this organisation called OWASP, which stands for Open Worldwide Application Security Project. These guys set clear standards for what an application system can be exposed to in terms of vulnerabilities. They released a top-10 list of the most critical vulnerabilities in 2025. You can read about it here. There are also some more detailed articles on other vulnerabilities in an application, what they mean, and how to mitigate them , worth a read if you want to go deeper.

But now, how does this affect an AI engineer, you might ask? The issue is that when using LLMs or building agentic systems, they are obviously highly prone to these vulnerabilities and might be easier to hijack, because the LLM might not know the right way to go unless you explicitly guide it with instructions. Now, what Anthropic is talking about when it comes to using LLMs to secure code starts to make sense, right?

They highlighted a step-by-step process to go about this:

Threat model
Sandbox
Discovery
Verification
Triage
Patching

The good news is that Anthropic has already built out skills for each of these steps, and they are all available in this GitHub repo: You essentially just need to feed your own system into it, and you are good to go. I will explain each step below, so you understand what is actually happening under the hood.

So what do each of these mean?

Threat model: Before you even start scanning for vulnerabilities, define what a vulnerability looks like in your system. Not every system is the same. A bug that is critical in one application might be completely irrelevant in another, depending on how it is built and who has access. The threat model is essentially a document that captures your system's context, what assets you are protecting, who can access what, and where your trust boundaries are. Anthropic's blog actually suggests having the LLM help you build this by feeding it your architecture docs, past bugs, and git history, and then having it interview someone who knows the system well. The output should be something like a thing in your codebase that gets updated as the system changes. One team they referenced went from a high false-positive rate to nearly 90% exploitability accuracy by simply giving the model a well-defined threat model. That tells you how much context matters. There is a threat-model.MD skill in the repo that walks you through exactly this.

Sandbox: This is about creating a safe, isolated environment where the LLM agent can run and test without touching your real systems or production data. Think of it as a controlled test environment that mirrors production as closely as possible. The reason this matters is two-fold. First, you want the agent to operate safely without a sandbox; otherwise, it could accidentally access services it was never supposed to touch. Second, the sandbox allows you to prove that a vulnerability is actually exploitable rather than merely suspected. When an agent can compile code, run a proof of concept, and watch it detonate in a sandboxed environment, you move from "This might be a problem" to "This is definitely a problem." That distinction saves a lot of wasted engineering time later. The repo includes a reference sandbox setup with everything you need to get this running.

Discovery: This is the actual scanning phase, where the model analyses your source code for vulnerabilities. What is interesting here is that Anthropic found that simpler, less prescriptive prompts work better. If you give the model a long checklist of things to look for, it narrows its own thinking and misses things. Better to tell it the goal, give it the threat model as context, and let it explore. They also suggest splitting the codebase into sections and running multiple agents in parallel on different parts, so you don't end up with five agents all finding the same surface-level bugs. Discovery optimises for catching as much as possible, so at this stage, you want recall over precision. You will clean it up in the next steps. There is a vuln-scan skill in the repo that handles this partitioning and parallel scanning automatically.

Verification: Here is where you filter out what is not actually exploitable. A separate agent, completely independent of the one who made the discovery, is brought in specifically to try to disprove each finding. It does not get to see the first agent's reasoning. The whole point is to have something that challenges the findings rather than confirms them. When teams let the same agent verify its own work, it just agreed with itself. Keeping them separate roughly halved the number of false positives. If the sandbox is in place, the verifier can also attempt to build and run a proof of concept, and if that succeeds, you have a confirmed, exploitable vulnerability. If it fails, the finding is still kept but flagged as unproven rather than dismissed entirely. This is handled as part of the triage skill in the repo, which runs multi-vote verification before moving on.

Triage: At this point, you may have a long list of verified findings, but not all are equal. Triage is about organising that list into something a team can actually act on. The first thing to do is deduplicate by root cause; the same underlying bug might show up at 20 different call sites in the codebase, but that is one fix, not 20. Once you have done that, you rank by real-world severity. Key question: Can an unauthenticated attacker reach this? Does untrusted input actually flow to the vulnerable point? How much damage could be done if it is exploited? The goal is to hand engineers a short, prioritised list where the most dangerous items are at the top, not a dump of hundreds of findings that creates alert fatigue and gets ignored. triage The repo's skill handles deduplication and re-ranking for you.

Patching: The final step is writing the actual fix. The approach that works best is test-driven: write a test that fails because of the vulnerability, then write the patch that makes it pass, then confirm nothing else broke. The model is also prompted to look for variants of the same bug elsewhere in the codebase, because a codebase that has one SQL injection vulnerability often has more of them in similar patterns. Before shipping, a fresh agent performs an adversarial check on the patch itself, looking for ways to bypass it. Generated patches can sometimes be too restrictive and break dependencies, so human review remains important. The goal is to make that review as lightweight as possible by the time it reaches an engineer. There is a patch skill in the repo that generates candidate fixes and runs an independent reviewer agent on each one.

The bigger picture

What Anthropic's work points to is a shift in where the constraint actually sits. Finding vulnerabilities at scale is no longer the hard part. LLMs can now do that quickly and across large codebases. The bottleneck is everything that comes after: confirming what is real, prioritizing what matters, and actually shipping the fixes. As of late May 2026, Anthropic had disclosed over 1,500 vulnerabilities in open-source software through its scanning work. Only 97 had been patched.

That ratio is the story.

For anyone building systems today, especially agentic systems, this is worth sitting with. The tools to identify your vulnerabilities before someone else does are becoming increasingly accessible. But they are only useful if the pipeline after discovery is set up to handle what they surface.

Further reading:

Anthropic — "Using LLMs to Secure Source Code": https://claude.com/blog/using-llms-to-secure-source-code
OWASP Top 10 2025: https://owasp.org/Top10/2025/
Top 10 Common Web Application Vulnerabilities: https://medium.com/@ajay.monga73/top-10-common-web-application-vulnerabilities-and-best-practices-for-prevention-430fc675f273

Contents

So what do each of these mean?

The bigger picture