You’re Probably Overpaying for the Tokens That Find Your Vulnerabilities

The most expensive, most sophisticated security model on our benchmark finished tenth. A model that costs a dollar to run beat it. Here's what you're actually getting per dollar — and it isn't what the price tags suggest.

Here is an uncomfortable fact for anyone spending money to find vulnerabilities in their code with AI.

We took two dozen models, ran each through the same agentic harness against the same 26 real codebases containing 697 hand-labeled vulnerabilities, and scored every one with the same open pipeline. Same scaffolding, same prompts, same targets — only the model in the slot changes. Then we lined up the results next to what each model costs to run.

The relationship between price and performance was not what the marketing would lead you to expect. It was barely a relationship at all.

The single most expensive, most heavily promoted security model in the lineup — the one its maker spent the spring calling almost too dangerous to release — finished tenth out of twenty-four. A model that costs roughly one dollar to run the entire benchmark finished fifth, ahead of it, ahead of models costing thirty and seventy times more. The gap between the best model on the board and one costing a sixtieth as much was under four points.

If you are paying frontier prices on the assumption that frontier prices buy frontier security, the data says you should check that assumption. Loudly.

What we measured, and why you can trust the number

RealVuln is an open benchmark for security scanners. Twenty-six real Python repositories, 697 vulnerabilities labeled by hand, and 120 deliberate false-positive traps — code engineered to look vulnerable but isn't, planted specifically to catch tools that bluff their way to a good score. Every model runs through the same fixed harness against the identical gauntlet, so the only thing changing between rows is the model itself. That is what makes the cost comparison fair: you are seeing what each model contributes in the same slot, not the results of twenty-four differently-tuned products.

One thing separates this from a vendor's accuracy claim: you can check it. The labels, every model's raw output, the scoring engine, and the prompts are all open under an MIT license. Re-run any row and you get the same number — including the rows where our own scanner wins (RealVuln is built by Kolega, and Kolega's scanner is on the board) and the rows where it loses. A benchmark you can't lose on is just a billboard with extra steps.

A word on how we score, because it is the heart of the argument. We rank by F3, not the more familiar F1. Both balance precision (how many of a tool's findings are real) against recall (how many of the real bugs it caught), but F3 weights recall about nine times more heavily. That is a deliberate, defensible choice for security: a false alarm costs an engineer a few minutes; a missed vulnerability ships to production and waits for someone to exploit it. We reward the tool that misses the least. The whole industry has spent years optimizing for the opposite — quiet, tidy output that demos well — and that is part of how we ended up here.

The part that should change how you buy

Read the top of the board with the price column open next to it.

GPT-5.5 wins. F3 of 60.2, the best on the board, at around $66 to run and a published rate of $5 per million input tokens. The money is not wasted — it genuinely tops the chart. If you have a security-critical codebase, missing a bug is catastrophic, and budget is no object, the best is the best and you should buy it.

But look one row down from where you'd expect the value to be. DeepSeek V4 Flash scores 56.5 — fifth overall — for about a dollar a run, at fourteen cents per million input tokens. That is within four points of the winner at roughly a sixtieth of the cost. GLM-5.1 reaches 57.1, second on the entire board, for about ten dollars. Kimi K2.6 clears most of the expensive frontier for six.

Now look at what the premium tier actually delivered. Claude Opus 4.8, at $36 a run, scored 53.6 — below a one-dollar model. Claude Fable 5, the headline act, at an estimated $71, scored 50.5 and finished tenth — behind three older, cheaper Claudes from its own maker. The most you could spend in that family did not buy the most you could catch. It bought less.

Plot the whole board, price on one axis and security performance on the other, and the line you'd expect — pay more, catch more — does not appear. In places it bends the wrong way. That is the finding. Not that expensive models are bad, but that price has come loose from performance, and anyone using cost as a proxy for security capability is navigating by a broken instrument.

Why the expensive model can lose

Two reasons, and neither is a scandal.

The first is specific and documented. Fable 5 routes work it identifies as cybersecurity to an older, weaker model — so when you pay for the newest, most capable public model and ask it to find vulnerabilities, a meaningful share of the time you are quietly handed last year's model instead. Pointed at deliberately vulnerable code, it often returns a refusal rather than a finding. You are, in a real sense, paying a premium for the privilege of being declined.

The second is the deeper one. Raw model horsepower was never the same thing as finding vulnerabilities. Spotting that something looks off is the part big models are naturally good at, and it scales with size and price. But catching the real bugs across an entire codebase — methodically, run after run, without skipping the hard ones — is a matter of discipline more than brilliance, and discipline does not come with a higher token price. A smaller, cheaper, well-behaved model that works the whole problem can out-recall a larger one that is dazzling and erratic. Because we weight recall so heavily, that is exactly the model our board rewards.

What to actually do about it

Four things.

Stop treating the price tag as a quality signal for security work. On this evidence it isn't one. The most expensive tokens are sometimes the best and frequently not, and you cannot tell which from the per-million rate.

Decide what you are optimizing for before you choose. If a single missed vulnerability is an existential risk, buy the top of the board and pay for it. If you are scanning a large codebase continuously — every commit, every pull request, where cost multiplies by volume — a model that catches nearly as much for a fraction of the price is not a compromise, it is the correct operational answer. The cheapest credible models make "scan everything, all the time" financially trivial; the frontier flagships make it a budget meeting.

Test the model slot independently of everything else you've built. The harness, the prompts, the skills, the tool wiring — that is where most of your real engineering effort goes, and it is worth it. But the model dropped into that work is often the most expensive line on the bill and the easiest thing to swap, and on this evidence swapping down can cost you almost nothing in detection. You don't have to rebuild your stack to stop overpaying. You have to re-run it with a cheaper engine and check whether anything actually got worse.

And apply the one test that survives all marketing. When any vendor or lab shows you a security number, ask whether you can run it yourself and get the same result. If you can't, it isn't data — it's a billboard. The reason we built this in the open is so that, for once, the answer is yes.

The board is open. The data is open. The scoring is open. See where every model landed, and re-run any number yourself. What you do with the budget you were about to spend is up to you.

The leaderboard, every model's raw outputs, the labels, and the scoring engine are open under an MIT license at realvuln.com — re-run any number and you'll get the same result, including the rows where Kolega's own scanner loses. RealVuln is built by Kolega; Kolega's scanner is on the board. Scores are F3 over 26 Python repositories, 697 hand-labeled vulnerabilities, and 120 false-positive traps; recall and precision are shown alongside for every model. Run-cost figures are estimates from published per-token pricing as of June 2026 and exclude batch and prompt-caching discounts, which lower real-world cost further; Fable 5's cost is an estimate of roughly twice Claude Opus 4.8's measured cost on the same benchmark.

Contents

What we measured, and why you can trust the number

The part that should change how you buy

Why the expensive model can lose

What to actually do about it