GPT-5.5: A Step Change in AI-Powered Hacking

XBOW's evaluation of GPT-5.5 reveals a massive leap in hacking capabilities, with the model reducing vulnerability miss rates to just 10%. Its blackbox performance now exceeds the whitebox results of previous models, effectively maxing out existing security benchmarks. These improvements in speed and strategic decision-making make GPT-5.5 a highly practical and powerful tool for automated penetration testing.

Key Points

GPT-5.5 achieves a 10% miss rate on vulnerability benchmarks, a drastic improvement over GPT-5's 40% and Opus 4.6's 18%.
The model's blackbox testing capabilities now outperform the whitebox capabilities of previous-generation models.
GPT-5.5 is twice as fast as other models at both successful system logins and identifying when access attempts have failed.
The model shows a superior ability to 'persist or pivot,' knowing when to give up on unproductive paths to save time and resources.
The performance jump in GPT-5.5 is described as a 'Mythos-like' step change rather than a standard incremental update.

Sentiment

The community is predominantly skeptical. Most commenters view the article as marketing rather than rigorous research, question whether GPT-5.5 truly surpasses what smaller models can already do, and point out that OpenAI's own safety restrictions undermine the accessibility claims. A small minority acknowledges genuine technical progress but with heavy caveats about independent verification.

In Agreement

The drop in vulnerability miss rate from previous model generations is genuinely eye-opening and represents meaningful technical progress
GPT-5.5 benchmarks similarly to Mythos on Anthropic's published cyber benchmarks, meeting the bar for comparison
The ability to not just detect but build and run exploits for validation is a key differentiator from simpler vulnerability scanners

Opposed

Smaller open-weight models have already been shown to find the same headline vulnerabilities as Mythos, making the claimed step change exaggerated
Both Mythos and GPT-5.5 require highly specific prompts and multiple targeted runs, so the practical gap with smaller models is narrower than presented
The article reads like a thinly disguised OpenAI advertisement with self-aggrandizing language and misleading data visualizations
Safety filters actively block legitimate security research uses, with users reporting account warnings and ethical refusals that contradict the 'open to all' promise
Vendor-adjacent benchmarks should be taken with a grain of salt until the broader security community independently verifies results
Mythos itself has provided no public proof of its claimed capabilities, making comparisons to it dubious