OpenAI Launches GPT‑5.2: SOTA Model for Professional Work and Agentic Workflows

OpenAI launched GPT‑5.2 (Instant, Thinking, Pro), a major upgrade in professional capability with state-of-the-art results across knowledge work, coding, long-context reasoning, tool use, and vision. It improves factuality, handles up to 256k-token contexts more accurately, and executes complex workflows with better tool reliability and lower latency options. Rolling out to paid ChatGPT plans and the API, GPT‑5.2 introduces new pricing, model names, and safety enhancements.

Key Points

GPT‑5.2 (Instant, Thinking, Pro) delivers state-of-the-art performance in professional knowledge work, coding, long-context reasoning, tool use, and vision.
On GDPval, GPT‑5.2 Thinking beats or ties industry professionals 70.9% of the time, with significant gains in spreadsheet/presentation generation and coding (SWE-Bench Pro 55.6%).
Long-context reasoning leads on OpenAI MRCRv2 (near-100% at 256k for 4-needle), with a new /compact endpoint extending effective context for long-running, tool-heavy workflows.
Tool calling is far more reliable (Tau2-bench Telecom 98.7%), with improved low-latency performance and stronger end-to-end agentic workflows.
Available now in ChatGPT paid plans and the API with new pricing (gpt‑5.2: $1.75/M input, $14/M output), safety upgrades for sensitive content, and maintained support for GPT‑5.1.

Sentiment

HN's response is engaged but predominantly skeptical. The community acknowledges the benchmark gains and celebrates practical improvements in coding and agentic tasks, but the dominant discussion frame—anchored by the top comment—shifts toward what GPT-5.2 still doesn't solve: hallucination, grounding, and the gap between marketed and actual capabilities. Disagreement is substantive and technical rather than hostile.

In Agreement

Power users report GPT-5.2 delivers genuine improvements for hard coding tasks—particularly Rust/CUDA, complex bug hunting, and long agentic sessions—over its predecessors.
The xhigh reasoning effort mode is seen as a meaningful step for professional and deeply technical work that previous models struggled with.
Some users confirm early-impression gains in coherence and reliability for multi-step agentic workflows.
Simon Willison's pelican SVG test suggests measurable improvement in visual and creative code generation quality.

Opposed

The most upvoted thread argues that better grounding and hallucination reduction matter far more than benchmark gains; raw intelligence improvements are hitting diminishing returns for most everyday use cases.
Benchmark scores are viewed skeptically by many users who argue models can be trained directly on benchmarks, rendering public leaderboards unreliable.
The advertised 400K context window is called out as misleading: actual ChatGPT limits are 16K–196K by tier, and consistent long-context reliability for non-RAG tasks is questioned.
LLM hallucination is characterized as architectural—a consequence of optimizing for plausible token sequences rather than truth—making it unlikely to be solved with incremental improvements.
Users report frustrating inconsistency with the same prompts giving different results across days, breaking automated workflows.
Skeptics question whether excited testimonials in the thread are organic or influenced by marketing, reflecting broader distrust of AI hype cycles.