June 16, 2026
Every Security Tool Has an Accuracy Number. Not One of Them Means Anything.
A short, uncomfortable look at how an entire industry agreed to measure itself with numbers nobody can check.
Jost Faganel
3 min read
Try this. Open the websites of the tools that are supposed to find the vulnerabilities in your code, and write down the accuracy figure each one leads with.
OpenAI's Aardvark: 92%. DryRun Security: 88%. Corgea: around 75% fix accuracy. Snyk: an AI trained on 25 million data-flow cases. Aikido: the broadest coverage and the fewest false positives. AISLE: twelve out of twelve zero-days in OpenSSL. Anthropic's Mythos: thousands of vulnerabilities, a model deemed too dangerous to release.
Now line them up next to each other and try to decide which tool is best.
You can't. Not "it's hard." You literally cannot, because not one of those numbers is measured against the same thing as any other. Different code. Different definitions of what counts as a vulnerability. Different scoring. And in almost every case, a test set the vendor selected and will not show you.
A 92% measured on your own private benchmark and an 88% measured on yours are not two points on the same scale. They are two different companies each holding up a trophy from a competition they organized, refereed, and won. We have all been squinting at these trophies and nodding as if they tell us something.
They tell us nothing.
This is the strange, quiet bargain the security industry has made. Every vendor publishes a number. Every buyer knows, somewhere, that the number is marketing. And everyone proceeds anyway, because the alternative is admitting that the thing we most want to know about a security tool, does it actually catch what it's supposed to catch, is a thing we have agreed not to measure honestly.
It is worth sitting with how odd that is. In most fields that matter, you can check the claim. A car's crash rating comes from a standard test anyone can look up. A drug's trial data is published and picked apart. But the software that decides whether the code running your bank, your hospital, your phone is safe to ship? That gets graded by the people selling it, on exams they wrote, and we file the results under "good enough."
None of this means the numbers are lies. Most of them are probably honest, in the narrow sense that the vendor really did measure what they say they measured. That is exactly the problem. Honest numbers that can't be compared are worse than useless, because they look like information. They give you the feeling of having checked without the fact of it.
So here is the test I would apply to any security vendor now, including the one I work for. Ask them a single question: can I run your number myself and get the same result?
If the answer is no, you do not have data. You have a billboard.
We got tired of squinting at billboards, so we built the thing that should have existed already. It's called RealVuln. It is an open benchmark for security scanners: real vulnerable code, vulnerabilities labeled by hand, traps planted to catch tools that bluff, and a scoring script anyone can run. All of it open, with the methodology written up in a peer-reviewable paper on arXiv so the design can be picked apart, not just trusted. We put two dozen scanners on it. We put our own scanner on it too, and we publish the metric where our own tool loses, because a benchmark you cannot lose on is just a billboard with extra steps.
It is not finished and it is not perfect. It is Python only for now, the scoring choices are debatable, and we say so in the open where you can argue with us. That is the point. You are allowed to check our work. You are encouraged to. If we are wrong, the repository takes pull requests.
And the invitation goes to everyone with a number. The scanners that aren't on the board yet: come put your accuracy claim somewhere a buyer can verify it. The ones already on it: re-run it and keep us honest. And the labs sitting on the most powerful security models of all, the ones locked behind a vetted-access list, Anthropic's Mythos most of all: your capability currently exists in a press release and a system card. Run it in the open and it becomes a number the rest of us can actually trust.
Nobody is obligated to accept. But from this week forward, when a security tool shows you a figure and cannot show you how to reproduce it, you are allowed to ask why. And you are allowed to notice who goes quiet.
The board is open. The data is open. The scoring is open. The only thing left to find out is who is willing to be measured.
You can see it, run it, and try to break it at realvuln.com.
RealVuln is an open-source benchmark for security scanners, built by Kolega, with its methodology published in a paper on arXiv. Kolega's scanner is on the board, and the benchmark publishes the metrics where it loses. Every number above is real and linked to its source; they are cited as examples of claims you cannot independently verify, not as claims that are false. The data and the scorer are at realvuln.com.