Anthropic Unveils Claude Opus 4.6: SOTA Agentic Coding, 1M-Token Context, and Stronger Safety

Anthropic released Claude Opus 4.6, delivering stronger coding, planning, and long-context performance, including a 1M-token context window (beta). It leads key benchmarks across agentic coding, deep reasoning, and web search, and shows markedly better long-context retrieval and coherence than prior models. New safety probes and extensive audits preserve a strong safety profile, while API and product features (adaptive thinking, effort controls, context compaction, agent teams, Excel/PowerPoint) help users apply the gains.

Key Points

Opus 4.6 advances agentic coding, long-horizon planning, and large-codebase work, with a 1M-token context window in beta and up to 128k output tokens.
It achieves state-of-the-art results on major evaluations (e.g., Terminal-Bench 2.0, Humanity’s Last Exam, BrowseComp, GDPval-AA), significantly outperforming Opus 4.5 and GPT-5.2 on key metrics.
Long-context retrieval and reasoning see a step-change (e.g., 76% on MRCR v2 1M vs. 18.5% for Sonnet 4.5), reducing context rot and improving real-world performance.
Safety is strengthened with comprehensive audits, interpretability-informed checks, low misalignment and over-refusal rates, and new cybersecurity probes; defensive cyber use is prioritized.
New API features (adaptive thinking, effort controls, context compaction, US-only inference) and product updates (Claude Code agent teams, improved Excel, PowerPoint preview) enable more robust workflows; base pricing stays the same with premium pricing for very long prompts.

Sentiment

The overall sentiment is cautiously positive. Hacker News largely agrees that Claude is the leading coding assistant and that Opus 4.6 represents genuine improvement, but the community is deeply skeptical of benchmarks, concerned about pricing sustainability, and divided on whether the improvements represent transformative progress or incremental gains. The coding community specifically appreciates Claude's directness and capability, while noting significant gaps in non-coding use cases.

In Agreement

Claude excels at coding tasks and provides honest, non-sycophantic feedback that is more useful than ChatGPT's overly positive responses
The 1M-token context window is a meaningful advancement for working with large codebases
Inference costs are falling significantly across the industry, making agent workflows more practical
Claude Code features like memory and agent teams represent valuable innovation in developer tooling
Claude is becoming competitive for non-technical general-purpose use, with some non-technical users switching from ChatGPT
Anthropic's focus on safety and alignment is differentiated and valuable for enterprise and regulated industries

Opposed

Long-context tests using Harry Potter books are meaningless because the content is already in training data
AI benchmarks are inherently unreliable and companies have strong incentives to game them
Claude is significantly weaker than ChatGPT and Gemini for non-coding tasks like research, recipes, and travel
Rate limits on the Pro plan make Opus practically unusable for regular work
AI companies are still burning through investor money and may never achieve profitability when training costs are included
LLMs are fundamentally just pattern matching from training data, not demonstrating genuine intelligence