GPT‑5.1‑Codex‑Max: Long‑Horizon Agentic Coding with Compaction and Fewer Tokens

OpenAI launched GPT-5.1-Codex-Max, a long-horizon agentic coding model that uses compaction to work across multiple context windows and deliver project-scale results. It outperforms prior models on key benchmarks while using fewer tokens, adds a new ‘xhigh’ reasoning mode, and supports Windows and improved CLI collaboration. Available now across Codex surfaces (API coming soon), it ships with stronger safety measures and is recommended for agentic coding in Codex-like environments.
Key Points
- New agentic coding model (GPT-5.1-Codex-Max) introduces compaction to work coherently across multiple context windows, enabling project-scale, long-running tasks.
- Strong benchmark gains and practical improvements (Windows support, better CLI collaboration) with significantly improved token efficiency (≈30% fewer thinking tokens at medium effort on SWE-Bench Verified).
- New ‘xhigh’ reasoning effort offers better answers for non-latency-sensitive tasks; ‘medium’ remains the recommended default.
- Safety measures include cybersecurity monitoring, secure sandbox defaults, guidance on prompt-injection risks, and a continued requirement for human code review.
- Available now in Codex for Plus/Pro/Business/Edu/Enterprise, replacing GPT-5.1-Codex as default; API access is coming soon and use is recommended for agentic coding contexts.
Sentiment
The community is cautiously positive about GPT-5.1-Codex-Max's improvements but deeply divided on the fundamental question of what makes a good coding agent. Those who value autonomous, specification-driven work praise Codex highly, while those who want an interactive collaborator with common sense prefer Claude. There is broad agreement that compaction and token efficiency are welcome improvements, but significant skepticism about the long-running autonomous vision and benchmark-driven marketing. The discussion frequently drifts into cross-provider complaints about payment systems and product complexity, suggesting user experience is as important as model capability.
In Agreement
- Codex's extreme instruction adherence makes it excellent for specification-driven tasks that need to be correct, and several users share impressive results including major codebase refactors completed autonomously on first attempts
- Native compaction training is a meaningful improvement — training the model on compacted context rather than relying on system-prompt instructions should produce better long-running session coherence
- Token efficiency improvements are welcome, as earlier Codex versions consumed tokens aggressively with minimal improvement in output quality
- Head-to-head comparisons show Codex producing better implementation plans and more comprehensive code than Gemini 3 Pro, which hallucinated database columns and skipped requirements
- The model is well-suited for debugging, log analysis, and other tasks where you can define verification criteria and let it work asynchronously for extended periods
Opposed
- Codex lacks common sense — it would rather rewrite a TLS library than ask if the network is available, making it unsuitable as a true 'coding partner' and better described as an outsourced consultant
- The emphasis on long-running autonomous work misses what most developers actually need: high-quality short iterative tasks with human-in-the-loop collaboration
- Benchmark improvements may not reflect real-world gains, with concerns that GPT models are overfitted to benchmarks like SWE-Bench
- Codex's restrictive sandbox creates practical problems — Go developers cannot access module caches, forcing use of the unsafe full-permission mode
- The strategic timing of the release alongside Gemini 3 suggests this is incremental rather than a major leap, and the naming scheme is becoming absurdly confusing
- Multiple users report that Claude actually does follow CLAUDE.md instructions well, contradicting the top comment's claim, suggesting experiences vary dramatically