How I Structure My WordPress Vulnerability Hunting Workflow

Over the past two months I've been building out and testing a vulnerability hunting framework for the WordPress ecosystem. I want to be upfront about what it is and isn't: it's not a magic box that replaces manual auditing. It's a system that helps me spend my time better — scan more targets, kill noise early, hold onto review state across runs, and consistently surface the most promising leads for me to look at myself.

At a high level it combines custom static analysis, staged AI review, local verification, and reporting. The same general pattern applies whether I'm looking at plugins or themes, even though each has its own quirks. I think of it less as an "autonomous researcher" and more as something that helps me prioritize and build evidence faster.

Why I Built It This Way

The WordPress plugin/theme ecosystem is just too big to review by hand at any real scale. Even if I narrow down to specific vulnerability classes, there's still a huge amount of code, and a lot of findings that look suspicious at first glance but collapse on closer inspection.

Plain SAST isn't enough on its own — it's great at flagging suspicious patterns, but also great at burying you in noise. And throwing an LLM at "review everything" with no structure is both expensive and unreliable. So I built this in stages on purpose.

The basic idea: do the cheap, simple rule-based checks first, save the expensive analysis for whatever survives, and keep myself in the loop at the end.

The High-Level Flow

1. Collection and static scanning

It starts with collecting targets and scanning code at scale. For public targets that means continuously pulling down WordPress plugins and themes and running them through a WordPress-focused static analysis pass.

The core scanning layer is Semgrep with a set of custom rules tuned for WordPress-specific patterns — things that actually matter on this attack surface, like request data flowing into dangerous sinks, missing authorization on AJAX/REST handlers, unsafe deserialization, file inclusion, XSS sinks, and so on.

This first pass is meant to be broad. Its job isn't to say "this is a vulnerability" — it's to say "this is worth a second look."

2. Stage 0: Simple rule-based triage

After the initial scan, the framework applies some rule-based filtering layer before any deeper review happens. This stage exists for one reason: static analysis is noisy, and there is no value in paying an AI model or a human to re-evaluate obvious junk.

This stage handles the boring-but-important cleanup:

groups findings into a smaller set of review families instead of treating each hit separately (One single code file may contain many findings in the same bug class)
throws out obvious noise such as generated or irrelevant paths
filters out findings that are clearly non-exploitable based on simple regex checks
keeps low-confidence "maybe worth a manual look someday" leads in their own lane instead of mixing them into the main pipeline

This stage also keeps the rest of the pipeline honest. If finding can't survive a basic sanity check, it has no business taking up review time later.

3. Deduplication and reusing past review state

WordPress scanning produces a lot of overlap. Different rules flag the same sink, the same plugin gets rescanned later with mostly unchanged code, and findings I already reviewed and dismissed keep coming back. Re-reviewing all of that from scratch every run would burn tokens and my attention for nothing.

So before anything moves forward, the framework collapses overlapping findings, checks what's already been reviewed, and only sends through what's actually new or still genuinely uncertain.

4. Stage 1: grouped AI triage

Once the obvious noise is gone, what's left goes through a lighter AI pass.

The goal here isn't certainty — it's sorting. Each finding ends up in roughly one of three buckets: worth a deeper look, probably a false positive, or still unclear.

One thing I do differently here: I don't review findings one at a time in isolation. Findings from the same file or the same area of code get grouped and reviewed together, because the context that actually tells you whether something's exploitable usually lives in the surrounding code, not just the flagged line.

That lets the model ask more useful questions, like:

Is the suspicious sink actually reachable?
Is there an auth check nearby that the scanner couldn't follow?
Is the data flow still dangerous once the surrounding code is considered?

By the end of this stage, the goal is not perfect truth. The goal is a much smaller queue of findings that are worthy of deeper analysis.

5. Stage 2: contextual deep review

This is where things get more selective and more about evidence-driven than pattern matching.

For each finding that made it this far, the system pulls in more surrounding sources and tries to answer the questions I would actually care about:

What's the likely vulnerability class
What access level does it need
Does the code path look practically exploitable
and — importantly — what's the actual evidence behind that conclusion.

This is also where I try to force discipline into the workflow. A strong-looking claim still needs grounded evidence. If the evidence is weak, incomplete, or contradictory, the right answer is not to auto-promote it just because it looks exciting.

That matters because a sink without a real source is not a bug. A source without a meaningful sink is not a bug. A suspicious pattern that falls apart once state, sanitization, or access boundaries are understood is not a bug.

Let's have some examples.

In this finding case, as you can see after running stage 1 — Quick AI triage, the AI is confident that the finding was a false positive and decides not to pass it to stage 2.

In this case, the finding did get promoted to stage 2, and the AI model in stage 2 is very confident that this finding is a real vulnerability.

But still, after dynamic verification in the local env with Claude code, the final results show that this finding is a false positive.

So basically, whatever comes out of this stage 2 with solid evidence — any findings that I'd consider a high chance of being a real vuln — is what I feed into my Claude-based verification flow next.

That flow has its own structure and skills it has to follow depending on the bug class, which is what the next section is about.

6. Promotion, local verification, and reporting

High-confidence findings move into a case workflow from here.

I weigh the vulnerability class, the apparent impact, how solid the evidence looks, and honestly just whether it feels real enough to be worth my time.

When I do want to verify a finding, I usually use Claude Code as a verification assistant. Depending on the vulnerability family, I can steer it with more specialized guides or skills for areas like XSS, SQL injection, deserialization, file inclusion, auth-boundary checks, and reporting.

Around that, I have supporting scripts and helper flows that handle repetitive work such as preparing or resetting the local WordPress environment, installing or removing the target plugin, collecting the right authentication context, and doing structured auth-probing or other setup steps.

That structure is also what makes the verification step actually efficient. Because Claude isn't starting from zero — it gets the right guide for the bug class and an environment that's already set up with the right context — it can stay focused on the actual question instead of burning tokens on setup, exploration, or figuring out what it's even looking at.

The reporting side follows the same philosophy — every finding carries its history with it: where it came from, why it survived earlier stages, what evidence backs it up, and what's still left for a human to decide.

One thing I've been thinking about for later is adding a separate Claude-based orchestrator that watches new findings, picks out the ones that look both interesting and plausibly exploitable, and kicks off verification while I'm asleep or away from the keyboard. I'm a bit cautious about this since it leans even more on AI judgment. Still, it could meaningfully cut down the lag between "a strong finding shows up" and "I actually looks at it."

End-to-End View

Each section above is how I think about one piece of the pipeline. Put together, the whole thing looks like this:

What This Has Been Good At

I'm genuinely happy with what's come out of this so far.

It's surfaced a real mix of vulnerability classes, including:

a critical unauthenticated PHP object injection
Some impactful SQL injection findings
XSS (a lot :v)
Simple broken access control
Local file inclusion / path-related bugs

The pipeline is good at turning a big pile of scanned code into actual leads worth chasing — especially for bugs that aren't trivially obvious but still leave enough static "fingerprint" for the staged review to build confidence on.

It's also been useful operationally just because it remembers things. Since it carries state forward instead of acting like a one-shot scanner, I can come back to the same plugin or ecosystem weeks later without redoing the same work.

Proof of work

The framework starts running at the end of April, after submit 9 report I got top 19 (close enought to get the bounty :v) with 1 zero day with CVSS score: 9.8.

In May the results is a bit better when I got in top 9

So far, I have already submitted 31 reports to PatchStack.

Where It Falls Short

I want to be honest about the limitation, too.

This is fundamentally a static-analysis-first system, and that comes with blind spots. It struggles with:

business logic flaws
ownership/authorization bugs that depend on application state
second-order vulnerabilities
multi-step exploit chains
cases where data passes through wrappers, storage layers, or framework abstractions before hitting the actual sink

Basically, if a bug depends on how the application behaves rather than how one function looks on its own, this framework gets a lot less reliable.

It has missed real, high-impact bugs, including deserialization and SQLi issues that turned out to be very impactful. Looking back at those misses, the problem usually wasn't that the bug was undetectable in principle. It's that the source and sink were too far apart, the dangerous behavior only emerged once you crossed a few layers, or the exploitability depended on context that's basically invisible from the code alone.

Broken access control is another area where I try to keep my expectations realistic. The simple "missing capability check" cases are usually easy to spot. The interesting BAC bugs almost never are — they're about ownership, object scoping, or workflow assumptions, and the framework often just doesn't have enough signal to catch those.

Why Manual Verification Still Matters

If there's one thing this project keeps reminding me, it's that finding something and proving it are two very different things.

Plenty of findings that look great on paper don't survive actual testing. Sometimes WordPress core changes the data shape just enough to break the exploit. Sometimes the sink is real but the attacker doesn't actually control the input the way it looks like on paper. Sometimes the direction is right but the impact is overstated.

So manual verification isn't optional for me — it's the gate before I trust a finding at all. The framework gets me to the right place faster, but it doesn't do the thinking for me.

The Part I Like Most

If I had to pick what I value most about this whole setup, it's not "it finds bugs automatically." It's that it gives me a structured way to hunt at scale without pretending the hard parts go away.

The scanner surfaces findings, the staged review cuts the noise, the deeper pass adds context, and the verification/reporting layer makes the last mile repeatable. And honestly, the misses are almost as valuable as the hits — they're what tell me where the methodology still needs work.

That feedback loop is probably the most valuable thing about the whole project. Every false positive, every missed bug, and every confirmed finding feeds back into the rules, the review flow, and how I decide what's worth my attention next.

On top of that, the Claude verification step has saved me a lot of time on the reproduction side. Often I already know exactly which function or line the vulnerable code lives in — what I don't know is how to actually reach it: which page, action, plugin setting, or request triggers that code path in a running WordPress site. Having Claude work through the plugin/UI to figure out the right trigger path has saved me a lot of time.

Closing Thoughts

I don't see this as a replacement for doing the research myself — it's leverage.

When it works, it lets me cover more ground, spend less time on obvious dead ends, and put my own attention where it's most likely to pay off. When it doesn't work, it's a pretty good reminder of why this job still comes down to judgment, skepticism, and actually verifying things.

And looking at the bounty platforms right now — Patchstack, Wordfence, and similar — submitting findings has started to feel like a race. A lot of researchers are submitting similar vulnerabilities at high volume and high speed, which tells me they're likely running frameworks similar to mine, or even more advanced.

What's made this clearer to me is going back through disclosed reports from top researchers on the Patchstack leaderboard. Many of the bugs they found are genuinely complex — the kind that would be very hard, maybe nearly impossible, to catch with static analysis alone. Looking at those reports, I can see clearly what my framework would have missed.

So while the framework is useful for covering ground quickly, I think there's still a lot of room — both to improve the tooling, and for old-fashioned manual research — to find the kind of deep, high-impact bugs that static-analysis-driven approaches just aren't built to catch.

Contents