Anthropic Unveils Claude Opus 4.6: SOTA Agentic Coding, 1M-Token Context, and Stronger Safety

Anthropic released Claude Opus 4.6, delivering stronger coding, planning, and long-context performance, including a 1M-token context window (beta). It leads key benchmarks across agentic coding, deep reasoning, and web search, and shows markedly better long-context retrieval and coherence than prior models. New safety probes and extensive audits preserve a strong safety profile, while API and product features (adaptive thinking, effort controls, context compaction, agent teams, Excel/PowerPoint) help users apply the gains.
Key Points
- Opus 4.6 advances agentic coding, long-horizon planning, and large-codebase work, with a 1M-token context window in beta and up to 128k output tokens.
- It achieves state-of-the-art results on major evaluations (e.g., Terminal-Bench 2.0, Humanity’s Last Exam, BrowseComp, GDPval-AA), significantly outperforming Opus 4.5 and GPT-5.2 on key metrics.
- Long-context retrieval and reasoning see a step-change (e.g., 76% on MRCR v2 1M vs. 18.5% for Sonnet 4.5), reducing context rot and improving real-world performance.
- Safety is strengthened with comprehensive audits, interpretability-informed checks, low misalignment and over-refusal rates, and new cybersecurity probes; defensive cyber use is prioritized.
- New API features (adaptive thinking, effort controls, context compaction, US-only inference) and product updates (Claude Code agent teams, improved Excel, PowerPoint preview) enable more robust workflows; base pricing stays the same with premium pricing for very long prompts.
Sentiment
The overall sentiment is cautiously positive. Hacker News largely agrees that Claude is the leading coding assistant and that Opus 4.6 represents genuine improvement, but the community is deeply skeptical of benchmarks, concerned about pricing sustainability, and divided on whether the improvements represent transformative progress or incremental gains. The coding community specifically appreciates Claude's directness and capability, while noting significant gaps in non-coding use cases.
In Agreement
- Claude excels at coding tasks and provides honest, non-sycophantic feedback that is more useful than ChatGPT's overly positive responses
- The 1M-token context window is a meaningful advancement for working with large codebases
- Inference costs are falling significantly across the industry, making agent workflows more practical
- Claude Code features like memory and agent teams represent valuable innovation in developer tooling
- Claude is becoming competitive for non-technical general-purpose use, with some non-technical users switching from ChatGPT
- Anthropic's focus on safety and alignment is differentiated and valuable for enterprise and regulated industries
Opposed
- Long-context tests using Harry Potter books are meaningless because the content is already in training data
- AI benchmarks are inherently unreliable and companies have strong incentives to game them
- Claude is significantly weaker than ChatGPT and Gemini for non-coding tasks like research, recipes, and travel
- Rate limits on the Pro plan make Opus practically unusable for regular work
- AI companies are still burning through investor money and may never achieve profitability when training costs are included
- LLMs are fundamentally just pattern matching from training data, not demonstrating genuine intelligence