The Mythos Ledger, Part 2: The Verification Crisis

Apr 28, 2026 - 19 Min read

Opinion Ai Security Cloud

The Noisy Neighbor

Imagine a SecOps team on a Monday morning. The on-call engineer opens the queue and finds 271 new candidate findings from the weekend’s automated scan. Each one is a plausible memory-safety issue. Each one is real in the sense that the code path exists. None of them comes with a determination of whether it is reachable from an external attacker, exploitable under the deployed configuration, or material to the business.

The engineer has eight hours of working time and a P1 ticket from another system that needs four of them. At machine-scale, the triage budget per finding works out to roughly 53 seconds - barely enough time for a human to read the ticket’s title, let alone determine whether it’s a false positive or a company-ending event.

This is the noisy neighbor problem in AI-driven security. The discovery layer has become an order of magnitude cheaper and faster. The verification layer - the work that determines which candidates are actually vulnerabilities in a specific environment - has not. Mozilla worked through 271 findings and shipped fixes for all of them in Firefox 150 (Mozilla, 2026); they have a mature security organisation that scaled to the load. The harder question for the rest of the industry is what happens when discovery rates double, then double again. When every codebase under management gets scanned weekly. When five vendors are offering this capability and the aggregate finding rate across an enterprise’s portfolio runs into the thousands per month. The answer cannot be that every organisation grows a Mozilla-grade security org; that is not how the maths works.

Part 1 of this series argued that the architectural response to AI-driven discovery has to be as autonomous as the discovery itself. This post is about the specific layer where that autonomy has to land: exploitability testing.

A Finding Is Not a Vulnerability

The first conceptual move that fixes the noisy neighbor problem is to stop using the words “finding” and “vulnerability” interchangeably. They are not the same thing.

A finding is a candidate - a code path, behaviour, or configuration that an analysis tool has flagged as suspicious. A vulnerability is a finding that has been shown to be reachable by an attacker, exploitable under the deployed configuration, and material in its impact. Most AI security tooling - and most of the marketing around it - collapses these into one word, which is how 271 findings becomes “271 vulnerabilities” in a press cycle and how a SecOps team without Mozilla-grade engineering capacity ends up with a backlog they can’t work through at speed.

The honest funnel from finding to actionable vulnerability has four stages, and each stage typically drops the candidate count by roughly an order of magnitude:

Discovery: The model surfaces a candidate. This is the cheap step.
Reachability: Can an attacker actually reach the code path under the deployed configuration? A buffer overflow in a parser only matters if untrusted input reaches that parser. A use-after-free in a privileged daemon only matters if an unprivileged process can trigger the relevant state.
Exploitability: Given that the path is reachable, can the candidate finding be turned into a working primitive - a crash that yields information disclosure, a memory corruption that yields code execution, an authentication bypass that yields privilege escalation? Most candidate findings cannot, because of mitigations the discovery layer doesn’t fully model: stack canaries, ASLR, sandboxing, control-flow integrity, modern allocator hardening.
Impact: Given that the finding is exploitable, what does an attacker actually gain? Code execution in a sandboxed renderer is materially different from code execution in a domain controller. Both are vulnerabilities; only one is a P1.

In short, Mythos is a world-class scout, but it’s still a junior engineer when it comes to filing the paperwork of a CVE.

The 4.4% number Davi Ottenheimer’s read at Flying Penguin (2026) extracted from Anthropic’s own system card is the funnel in microcosm. Anthropic’s headline 72.4% exploitation rate on the Firefox 147 corpus collapses to 4.4% when two specific bugs are removed from the test set. The model wasn’t generally exploiting a corpus of fifty bugs; it was exploiting two well-suited bugs across many trials, and the headline number was a property of the corpus rather than a property of the model. That is not an indictment of Mythos as a capability. It is a worked demonstration of how a discovery-stage number, presented without the exploitability-stage qualification, materially misrepresents the operational picture.

The 271-to-3 ratio in the Firefox 150 advisory is the same funnel running on real-world data, and the specifics matter. Of the 271 findings Claude surfaced in the Mozilla collaboration, three were credited to Claude as standalone CVEs in the formal advisory: CVE-2026-6746 (High), CVE-2026-6757 (Medium), and CVE-2026-6758 (Medium) (Mozilla, 2026; SecurityWeek, 2026). Firefox 150 itself addressed more than 40 CVEs in total, drawn from multiple sources of inbound vulnerability research (SecurityWeek, 2026).

That works out to roughly 98% of Claude’s findings not reaching the standalone-CVE bar. Note the careful phrasing: not “98% non-exploitable,” because some of the remaining 268 were almost certainly folded into Firefox 150’s aggregate memory-safety CVEs. The honest read is that 98% of what the discovery layer surfaced fell short of the threshold where Mozilla deemed an individual public advisory entry warranted - defence-in-depth fixes, hardening, lower-severity issues, and bugs in code paths where the impact stage of the funnel doesn’t justify standalone disclosure. The findings are real. The defects are real. The model’s contribution is real. They just sit at different points in the funnel from the headline number, and an enterprise that treats the headline as the operationally relevant figure is committing to triage capacity it doesn’t have.

Scaling the Verification Layer

The hard part is not building a verification layer. The hard part is building one that scales linearly with discovery rates.

Discovery scales with token spend. If a vendor wants to find ten times as many bugs next quarter, they buy ten times the model capacity and point it at a wider corpus. The finding rate scales with the budget.

Verification, done by hand, scales with engineers. There is no realistic scenario in which an enterprise hires verification engineers at the same rate AI vendors expand discovery capacity. The gap is structural. Either verification gets automated to the same degree discovery has, or the verification layer becomes the limiting factor on the entire pipeline - at which point the rational response is to scan less, which is the wrong answer.

A verification layer that scales has to do four things, and the engineering details on each one matter more than most teams realise.

1. Deterministic reproduction.

AI-surfaced findings are stochastic. The model that produced the candidate may not produce the same candidate twice, and the candidate’s description may underspecify the conditions needed to trigger it. The first stage of the pipeline takes the AI’s natural-language finding and converts it into a reproducible test case: an input, a state, a sequence of operations that reliably produces the flagged behaviour. If the candidate cannot be reduced to a deterministic reproduction, it cannot be validated and it cannot be fixed; it goes to a separate review queue with its own SLA. This stage typically eliminates a meaningful fraction of AI-surfaced candidates before any sandbox runs, because the candidate description doesn’t survive contact with the actual code path. Crucially, this stage scales: it is itself an automatable transformation, and it keeps the downstream stages from being flooded with findings that can’t be acted on anyway.

2. Production-shaped sandboxing.

The candidate then runs against an instrumented replica of the production environment, with the production defences in place. This is the part most homegrown attempts get wrong: testing the candidate in a stripped-down lab tells you whether the code path exists, not whether it is exploitable in the deployed configuration.

Anthropic’s own Firefox 147 demonstration is the cautionary tale. Davi Ottenheimer’s read at Flying Penguin (2026) documents that the model’s headline 72.4% exploitation rate was achieved against a SpiderMonkey JavaScript engine shell in a container - with the browser’s process sandbox stripped out, and several other defence-in-depth mitigations removed to make the test tractable. That is a lab condition. It is not a Firefox a real user runs. An exploit that works against an unsandboxed shell is materially different from one that has to defeat the deployed browser’s process isolation, ASLR, control-flow integrity, allocator hardening, and seccomp policies, and conflating the two produces capability claims that don’t survive a careful read.

The architectural implication is direct: a validation environment is only useful if it mirrors the production hardening surface. The defences that have to be modelled, at minimum, are the ones an attacker would actually face - ASLR and KASLR, stack canaries, CFI, sandboxing boundaries (process sandboxes, seccomp filters, namespaces, capability drops), modern allocator hardening, language-runtime mitigations, and the network and authentication topology that determines reachability in the first place. A finding that looks exploitable against a stripped-down rig and impossible against a hardened replica is not the same finding; the second answer is the operationally relevant one.

3. Primitive construction, not just crashes.

A crash is not an exploit. The next stage of the pipeline attempts to convert the candidate from a reproducible crash into an exploit primitive - a usable capability an attacker can actually build with. The taxonomy matters: information disclosure (a leak that defeats ASLR), arbitrary read or write (the basis of memory-corruption chains), control of an instruction pointer (the gateway to code execution), or a usable code-execution gadget chain on the deployed allocator and CFI policy. A reproducible crash that cannot be turned into one of these is, at most, a denial-of-service primitive. A vulnerability that can be turned into one of these is a different operational class entirely.

This distinction is where the capability frontier of AI security tooling actually lives, and the published evidence is starting to map it.

At the macro level, UK AISI’s evaluation of Mythos against the 32-step “Last Ones” benchmark is the strongest independent signal that frontier-class capability matters for end-to-end attack chains. Mythos completed the full chain in 3 of 10 attempts and averaged 22 of 32 steps; the next-best model (Claude Opus 4.6) averaged 16 of 32 (UK AISI, 2026). That gap doesn’t separate the discovery stage - most of those 32 steps are the harder reasoning that comes after. It separates everything that requires sustained multi-step planning under environmental constraints, which is exactly the territory of primitive construction.

At the level of the primitive itself, AISLE’s reproduction work draws the line more sharply. AISLE (2026) found that 8 of 8 models they tested - including a 3.6-billion-parameter open-weights model running at $0.11 per million tokens - could detect the FreeBSD NFS bug Anthropic showcased as a flagship demonstration. Detection, in other words, has commoditised. The gap that hasn’t commoditised, in AISLE’s framing, is the gap on primitive construction: bypasses for modern allocator hardening, multi-stage gadget chains, exploitation paths that require novel mathematical reasoning, and any chain whose construction depends on holding several constraints in working memory at once. On those tasks, frontier-class models are still meaningfully ahead of the open-weights ecosystem, and the gap is wide enough to justify the price difference.

The procurement implication is direct, and it inverts how most enterprises currently spend on AI security tooling. The right architecture uses commodity, open-weights models for the discovery stage - where capability has flattened and Mythos pricing buys very little incremental detection - and reserves frontier-priced model usage for the primitive-construction stage of validation, where the capability gap is real and the work is genuinely hard. Spending Mythos pricing on bulk detection is overpaying for a capability the open-weights ecosystem delivers for two orders of magnitude less. Spending it on primitive construction, where the gap actually exists, is rational.

The scaling property of the validation layer is what makes this allocation tractable. Ephemeral, parallel, IaC-driven sandbox environments mean a verification layer is throughput-limited by cloud capacity rather than by engineers - and frontier-model spend can be concentrated on the small fraction of findings that have already passed reproduction and reachability gates and warrant the deeper analysis.

4. Suppression with auditable reasoning.

This is the bit that breaks most attempts. Findings that fail validation cannot just disappear. The pipeline has to record what was tried, what failed, what assumption the suppression depends on, and what change in the environment would invalidate the suppression. Otherwise finding #5,847 gets dismissed by Monday’s on-call analyst, re-surfaced by the next scan, dismissed by Wednesday’s analyst on different reasoning, and the institutional memory of why it was dismissed twice never lands anywhere durable. Six months later a configuration change makes the finding exploitable, and nobody has the trail to know it. Validation pipelines that don’t generate auditable suppression trails produce worse outcomes than no validation pipeline at all, because they create false confidence on top of forgotten reasoning. At scale, this stage is what prevents the suppression backlog from quietly becoming the largest unmanaged source of risk in the security programme.

The cumulative effect, when this is wired correctly, is that a high-volume discovery output produces a much smaller number of validated, environment-specific, prioritised alerts that reach SecOps - each one carrying a deterministic reproduction, a validated exploit primitive against the production-shaped environment, and an impact assessment grounded in the actual deployed configuration. That is what an AI-augmented security pipeline is supposed to deliver. Most of what’s currently being sold under that label delivers raw discovery output and calls the rest a customer problem.

Sakura Proof-Point: A Reference Architecture for Exploitability Validation

The architecture this points to is what we call a Sakura Proof-Point - the deterministic exploitability proof that has to sit between every AI-surfaced finding and the alert that reaches SecOps. Before any alert hits your desk, it must pass a Proof-Point: a reproducible, environment-specific, audit-trailed determination that the candidate finding is actually exploitable against the deployed defences. Anything that cannot pass that bar does not become an alert; it becomes a logged suppression with auditable reasoning.

This is a reference design we are building toward with clients, not a productised offering as of today. It is best understood as a discipline rather than a product: a stack that AI-surfaced findings flow through before any alert reaches SecOps, designed to convert raw discovery output into validated, environment-specific, prioritised remediation work. The point is to turn alert fatigue into automated remediation, and the architectural choices below are the ones we think actually deliver a Proof-Point at scale.

IaC-synchronisation, not snapshots: A validation environment that drifts from production produces exploitability determinations that don’t generalise; six weeks of drift is enough to invalidate most of the runtime-defence assumptions the validation layer is making. The Proof-Point environment is built from the same Terraform, Helm charts, Kustomize overlays, and policy bundles (OPA, Kyverno) that produce production. Every production release rebuilds the Proof-Point environment from current IaC; every drift between the two is a CI failure rather than a tolerated inconvenience. The synchronisation is not a process - it is a build dependency.
Ephemeral, eBPF-instrumented sandboxes: Each candidate finding gets its own short-lived sandbox instance, instrumented at the kernel level rather than at the application level. eBPF programs collect syscall traces, network connection attempts, memory-allocation patterns, and filesystem access; userspace tooling layers on memory-corruption detection (ASan, MSan, HWASan as appropriate to the workload), control-flow integrity checks, and seccomp-policy enforcement traces. The instrumentation is rich enough to distinguish a contained crash from a primitive that escapes the deployed defences. Sandboxes are spawned in parallel and torn down on completion; throughput scales with cloud capacity rather than with engineers, which is what makes the layer keep up with discovery rates that scale with token spend.
Immutable validation logs: Every validation run - whether the candidate exploited successfully, failed against the production-shaped defences, or produced an undetermined result - writes an immutable log entry tied to the finding’s identity. The log records the candidate, the reproduction, the defences in place at the time of validation, the primitive attempted, the outcome, and the suppression reasoning if applicable. Logs are append-only, tamper-evident, and queryable across staff turnover. The investment is mostly in the schema and the tooling around it - the storage cost is negligible. The institutional-memory cost of not having this is, in our experience, the single largest source of recurring verification work in mature security programmes.
Bidirectional integration with the remediation pipeline: Validated exploitable findings don’t just generate SecOps alerts; they generate candidate fixes that flow through the same automated remediation pipeline described in Part 1. The validation log carries forward into the fix’s regression-test suite, so the same finding cannot recur silently after a future code change. Findings that fail validation today can re-surface automatically if a future configuration change invalidates the suppression’s underlying assumption - the immutable log makes that re-surfacing detectable rather than dependent on a human remembering.

The architecture is deliberately boring in its pieces - IaC, eBPF, immutable logs, automated pipelines - and deliberately ambitious in how those pieces are wired together. That is the right direction of travel. The interesting work in security automation right now is not in any individual component; it is in the integration discipline that turns discovery output into remediation throughput without flooding human attention.

The point of all this is not that human analysts are too slow. It is that human analysts working without a Proof-Point are forced to make exploitability calls under time pressure, on findings whose deployed-environment context they cannot fully reconstruct in 53 seconds, and at machine-speed discovery rates they will be wrong in both directions. The Proof-Point is the gate that lets human attention land only on findings whose exploitability has already been independently established, with the context they need to make a good call.

Questions Every Security Team Should Be Asking AI Vendors

Some of the operational damage in the current cycle is being done by procurement decisions that treat all AI security tooling as equivalent. It is not. The vendors with mature exploitability validation built in deliver materially different operational outcomes from the vendors who ship a discovery layer and call the rest a customer problem.

A few questions that separate the two, drawn from our Signal-to-Noise checklist:

Does the tool produce a deterministic reproduction for each finding, or only a natural-language description of the candidate?
Are findings validated against a production-shaped environment with the deployed defences in place, or against a stripped-down test rig?
What’s the false-positive rate after validation, and how is it measured?
When a finding is suppressed, is there an auditable reasoning trail that survives staff turnover?
Is there a mechanism to re-surface previously-suppressed findings if the deployed environment changes?
How does the tool distinguish between a crash and an exploit primitive?

The best vendors will answer these questions readily and with specifics. The vendors selling discovery as if it were verification will answer them with adjacent claims about model size, throughput, or vendor-validated benchmarks that don’t address the operational question.

The Sakura Sky Position: Verification Is the Real Moat

The discovery layer is now an arms race won by token spend. The verification layer is where defensive value actually accrues - and where most enterprises are unprepared.

Our position, applied across our client engagements, has three parts:

Treat verification as infrastructure, not as a workflow. A verification layer that depends on senior analysts making 53-second judgement calls is not a layer; it is a bottleneck wearing a uniform. The Sakura Proof-Point is the discipline that turns verification into infrastructure - IaC-synchronised sandboxes, eBPF-instrumented runtime, immutable validation logs, automated remediation integration - so that human attention lands only on findings that have been independently shown to be exploitable in the deployed environment.
Allocate model spend along the funnel, not across it. AISLE’s reproduction work (AISLE, 2026) shows that detection has commoditised: small open-weights models match frontier models on the discovery stage at two orders of magnitude lower cost. The capability gap that justifies frontier-model pricing lives in primitive construction - allocator bypasses, multi-stage gadget chains, the reasoning-heavy work of converting a crash into a usable exploit. We architect client pipelines around that asymmetry: commodity models for the wide top of the funnel, frontier-priced reasoning reserved for the narrow bottom.
Procurement is a verification decision, not a capability decision. The questions in the section above are the ones we apply on behalf of clients evaluating AI security tooling. Vendors that can answer them with specifics, recent numbers, and stable answers under repeated questioning are vendors building the verification layer. Vendors that can’t are selling raw discovery as if it were the product. The price difference between the two is largely the cost of the verification infrastructure the cheaper tools haven’t built - which is exactly the cost the client will absorb operationally if they procure on the discovery axis.

The summary, the line we end every architecture conversation with: verify before you scale. A discovery pipeline without a verification layer doesn’t make you faster - it makes you wronger, faster.

Bottom Line

Verification is the current operational bottleneck and primary control plane for scalable AI security, but there are two things that are true at the same time:

AI-driven discovery is a real and durable capability, and the cost of finding bugs has fallen by an order of magnitude.
And the cost of proving which findings are actually exploitable in a specific environment has not moved at all.

The teams that navigate the next eighteen months well will be the ones who recognise that the operational bottleneck is now exploitability testing, not discovery, and who invest in the verification layer with the same urgency that the press cycle has been demanding for the discovery layer.

The Sakura Proof-Point is the gate that converts that 53-second triage window into a comprehensive, automated determination - environment-specific, reproducible, audit-trailed, and integrated with the remediation pipeline rather than ending at an alert. That is what an AI-augmented security pipeline is supposed to deliver.

AISLE’s framing of the capability landscape is the line worth ending on: the moat in AI cybersecurity is the system, not the model (AISLE, 2026). Discovery is commoditising. The validation pipeline is where defensive value actually accrues. Build that, and your security posture is robust to whichever frontier model wins the next round of capability claims. Skip it, and you’ve bought a faster way to generate alerts you cannot triage.

Coming up next in The Mythos Ledger: Part 3 - The Glasswing Strategy. Project Glasswing as defensive coalition, the regulatory-capture critique, and how to build cloud architectures that consume real-time vulnerability intelligence regardless of which consortium controls the disclosure pipeline.

References

AISLE (2026) AI Cybersecurity After Mythos: The Jagged Frontier. Available at: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jagged-frontier (Accessed: 25 April 2026).

Flying Penguin (2026) The Boy That Cried Mythos: Verification is Collapsing Trust in Anthropic. Available at: https://www.flyingpenguin.com/the-boy-that-cried-mythos-verification-is-collapsing-trust-in-anthropic/ (Accessed: 25 April 2026).

Mozilla (2026) The zero-days are numbered. Available at: https://blog.mozilla.org/en/privacy-security/ai-security-zero-day-vulnerabilities/ (Accessed: 25 April 2026).

SecurityWeek (2026) Claude Mythos Finds 271 Firefox Vulnerabilities. Available at: https://www.securityweek.com/claude-mythos-finds-271-firefox-vulnerabilities/ (Accessed: 25 April 2026).

UK AISI (2026) Our evaluation of Claude Mythos Preview’s cyber capabilities. Available at: https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities (Accessed: 25 April 2026).

The Mythos Ledger, Part 1: The Firefox Watershed and the New Baseline of Vulnerability

The Mythos Ledger, Part 2: The Verification Crisis

The Noisy Neighbor

A Finding Is Not a Vulnerability

Scaling the Verification Layer

Sakura Proof-Point: A Reference Architecture for Exploitability Validation

Questions Every Security Team Should Be Asking AI Vendors

The Sakura Sky Position: Verification Is the Real Moat

Bottom Line

References

Related Articles

The Mythos Ledger, Part 1: The Firefox Watershed and the New Baseline of Vulnerability

The Autonomous Horizon, Part 6: The Architecture of Trust

The Autonomous Horizon, Part 4: Governed Execution and Control at Machine Speed

Intelligence, Engineered.