Exploits at Scale: When Token Throughput Becomes the Bottleneck

LLM agents successfully generated dozens of exploits for a QuickJS zeroday under modern mitigations, with results scaling predictably with token spend. This suggests exploit development is primed for ‘industrialisation,’ where token throughput supersedes human headcount as the bottleneck. The author urges public, large-scale evaluations on real targets and warns that while post-access tasks may be harder, SRE automation will signal when they’re next.

Key Points

Exploit generation shows clear token-for-results scaling: agents autonomously produced many working exploits, with GPT-5.2 solving all scenarios tested.
Industrialisation hinges on two ingredients: autonomous search across the solution space and automated, accurate verification—both well-suited to exploit development.
Current results likely generalize, but QuickJS is simpler than major browsers; nonetheless, the limiting factor increasingly appears to be token throughput, not human experts.
Post-access tasks (lateral movement, persistence, exfiltration) are harder to industrialize due to adversarial, irreversible consequences—SRE automation is a relevant ‘canary.’
Frontier labs and AI Security Institutes should publicly evaluate models on real, hardened targets with zerodays and large token budgets, moving beyond CTFs and synthetic tests.

Sentiment

The overall sentiment is a mix of concern and critical analysis. There's a strong acknowledgment of the impressive and potentially "scary" offensive capabilities of LLMs in exploit generation, indicating agreement with the article's core finding about LLM power. However, a significant portion of the discussion leans towards optimism about defensive applications and the inherent advantages defenders might retain or gain, suggesting a belief that a balance could be struck or even shifted towards greater security. There's also skepticism regarding the degree of human independence in LLM-driven security tasks, emphasizing the crucial role of expert setup and verification.

In Agreement

LLMs (specifically when integrated into agent harnesses with automated verification) are demonstrating impressive and concerning capabilities in generating complex exploits against modern mitigations, including chaining multiple calls to bypass protections.
The industrialization of offensive cyber operations, where token throughput becomes the limiting factor rather than human headcount, is a plausible and "scary" near-term future that warrants preparation.
The perceived discrepancy between LLMs generating excellent exploits and spamming useless bug reports is resolved by distinguishing between skilled users employing robust agent harnesses with automated verification (as described in the article) and unskilled users simply prompting raw LLMs.
The author's claims are credible, supported by endorsement from respected figures in the security community (e.g., Halvar Flake) and the author's own professional background and involvement in relevant companies.

Opposed

LLM-based tools for vulnerability discovery and exploit generation will be symmetric for defenders, who can use "LLM Red Teams" in CI/CD pipelines to find and fix issues before release, thus creating a balance.
Defenders inherently have an advantage due to privileged access to source code, design documents, configuration, and the ability to influence architecture and dependencies, which will likely lead to AI making systems more secure overall.
Software and hardware are continuously improving in security (e.g., language safety features, Memory Integrity Enforcement), which will structurally raise the bar for exploit development regardless of LLM assistance.
Some argue that writing the actual payload after achieving code execution is the easier part, implying that finding novel RCE vulnerabilities remains the harder, more interesting challenge that LLMs may not fully address yet.
The article's premise of fully human-independent industrialization is questioned, with arguments that significant human expertise is still required for initial setup, task design, and orchestrating the agents, and that claims of full autonomy might be exaggerated by model providers.