June 12, 2026
An Open Invitation to Anthropic: Fable 5 Came 10th on RealVuln. Help Us Put Mythos 5 on the Board.
We put Claude Fable 5 — the public version of “the world’s most dangerous AI” — on an open security benchmark. It landed mid-pack. But the…
Jost Faganel
5 min read
We put Claude Fable 5 — the public version of "the world's most dangerous AI" — on an open security benchmark. It landed mid-pack. But the model doing the headline work, Mythos 5, is locked away. So we're inviting Anthropic to put it on the board.
This week Anthropic shipped Claude Fable 5: the publicly available version of Mythos, the model the company spent the spring describing as nearly too dangerous to release. We run RealVuln, an open-source benchmark for security scanners, so we did the obvious thing. We added Fable 5 to the board and pointed it at 26 real vulnerable codebases containing 697 hand-labeled vulnerabilities.
It came 10th. Out of 24.
Its F3 score was 50.5. It placed behind GPT-5.5 (60.2), GLM-5.1 (57.1), DeepSeek V4 Flash (56.5, which costs about a dollar to run), Kimi K2.6, and Anthropic's own older, cheaper Claude Opus 4.8 (53.6). The most-hyped security model of the year landed in the middle of the pack, behind models a fraction of its price.
Before anyone assumes we cooked the numbers: we're a vendor. Our own scanner is on this leaderboard, and we say so in the paper. Every label, every raw output file, the scoring engine, and the full methodology are open source under an MIT license. You don't have to take our word for the 50.5 — you can re-run it yourself and get the same number. That's the entire reason we built the benchmark in the open, and it's about to matter more than the score.
Why a frontier model lands mid-pack
Two things are happening here, and neither is a conspiracy.
The first, Anthropic documents themselves. Fable 5 runs classifiers that detect cybersecurity work and route it to the older, weaker Claude Opus 4.8. So when you ask the "most powerful public model" to do security, a meaningful share of the time you are not getting the most powerful public model. You are getting last year's.
We saw this directly: running Fable against intentionally-vulnerable code through the standard API path returned refusals rather than findings, which is consistent with the documented routing. And we are not unusual in hitting it. Anthropic's own system card reports Fable 5 falling back to safety refusals far more often than the company's headline figure — 20.9% of trials on its own Terminal-Bench evaluation, against a claimed rate under 5%. Security practitioners have reported the same pattern with routine defensive work: not just offensive requests being blocked, but ordinary code review and security tasks getting quietly down-routed to the weaker model. The guardrails, in other words, do not only stop offensive use. They stop a lot of defensive use too.
The second thing is the deeper finding, and it has nothing to do with guardrails: raw model horsepower was never the same as scanner performance. Detection — spotting that something looks wrong — is the part large language models are naturally good at. The harder, less glamorous parts are what separate a research demo from a tool you would put in a production pipeline: following instructions every single time, keeping recall high without burying the team in false positives, and doing it reproducibly, run after run. Those are the things a benchmark measures and a launch video does not. (We will dig into that distinction in a companion piece, because it is the more important story.)
The two-tier problem nobody is naming
Here is the part that should bother every security team, and it is not really about a leaderboard position.
The version of this model that does the headline-grabbing security work — Mythos 5 — is restricted to a small set of vetted organizations. The version the rest of us can pay for and deploy is configured to refuse or degrade a large fraction of legitimate defensive work. A small company trying to find the vulnerabilities in its own code is on the wrong side of that line. We would know. We applied for the gated access, and we do not have it.
You can hold an entirely reasonable view that frontier offensive-security capability should be gated. Reasonable people do. But the consequence is concrete: the most capable public security AI of the year is, for most defenders, less useful than a frontier model from a year ago — and there is still no transparent, reproducible way to see what any of these systems actually catch on real code. A press release is not that. A curated demo is not that. A benchmark you can run yourself is.
That gap is the whole reason an open benchmark exists. And it is why our response to all of this is an invitation rather than an attack.
An open invitation to Anthropic
There's one obvious gap in everything above: the model we scored is Fable 5, not Mythos 5. We scored Fable because Fable is what the public can actually run. Mythos 5 — the ungated version, the one doing the headline security work — isn't on our board for the simple reason that we can't access it.
So we'd like Anthropic to put it there.
RealVuln is open source — the dataset, the scoring engine, the prompts, the false-positive traps, and the version-locked manifests. The invitation is specific and genuine:
Run RealVuln on Mythos 5 — the real, ungated model — and add its result to the board. Use whatever harness you run in production. Submit it yourselves or send us the run, and we'll publish it verbatim, scored exactly like every other scanner, with your methodology documented in full.
This isn't a challenge to beat a number. It's an invitation to be measured on the same footing as everyone else — against the same 697 vulnerabilities, including the false-positive traps a model can't memorize its way past, scored by the same open pipeline. Right now Mythos 5's security capability exists only in a 319-page system card and a set of curated results. The board is where it becomes a number anyone can reproduce.
The same invitation stands for OpenAI's security model, for Google's, and for any vendor who wants their tool represented. That is the entire design. The benchmark does not ask anyone to trust it. It asks them to take part in it.
The reason to want this
There is a reason to be skeptical of frontier-model security claims that has nothing to do with us.
Earlier this year, Anthropic's Mythos was run against curl through the Linux Foundation — one of the most heavily audited C codebases in existence, fuzzed and scanned for years. It reported five vulnerabilities. After triage, one low-severity issue remained, headed for a "severity: low" CVE. Zero memory-safety vulnerabilities.
curl's creator, Daniel Stenberg — who has more reason than almost anyone to want better security tooling, and who credits modern AI analyzers as genuinely better than traditional static tools — concluded that the hype around the model had been, in his words, "primarily marketing." That is his verdict, on his own project. In fairness, and it matters, curl is a brutally hard target precisely because it is so well audited; the one-bug result is not proof the model is useless. But the gap between the marketing and the measured result was wide enough that the maintainer of the world's most-deployed transfer library reached for that phrase. It is worth remembering the next time a security capability arrives by press release.
The honest posture — ours very much included — is to let people check the work. We publish the metrics where our own scanner loses. The only thing that makes any security number worth trusting, whether it comes from a four-person startup or a trillion-dollar lab, is whether you can run it yourself and get the same answer.
You can run ours. The leaderboard, the full Fable 5 deep-dive with every caveat, and the standing invitation to Anthropic are all at realvuln.com.
RealVuln is an open-source benchmark for security scanners, built by Kolega.dev. Kolega is one of the 24 scanners on the board, and the benchmark publishes the metrics where Kolega loses; every result is reproducible from the published files. Fable 5's run cost is an estimate (twice Claude Opus 4.8's measured cost on the same benchmark, matching its published API pricing); all other figures are measured. Full methodology and per-run details are on the dashboard.