Google Upgrades Gemini 3 Deep Think for Real-World Science and Engineering

Google unveiled a major upgrade to Gemini 3 Deep Think, enhancing its ability to solve complex scientific and engineering problems and making it available to Google AI Ultra subscribers and via early API access. Early users report real-world wins—from catching flaws in advanced math papers to optimizing crystal growth and speeding hardware design. The model sets new marks on top reasoning benchmarks and extends its strength across physics, chemistry, and practical engineering, including automated 3D-printable modeling from sketches.

Key Points

Gemini 3 Deep Think received a major upgrade focused on rigorous reasoning and practical engineering utility, developed with input from scientists and researchers.
It is available now in the Gemini app for Google AI Ultra subscribers, with early API access offered to select researchers, engineers, and enterprises.
Early applications include catching a subtle math proof flaw, optimizing thin-film crystal growth beyond 100 μm, and speeding physical component design.
The model sets or matches top results on demanding benchmarks: 48.4% on Humanity’s Last Exam (no tools), 84.6% on ARC-AGI-2, 3455 Elo on Codeforces, and gold-medal-level IMO performance.
Deep Think extends beyond math and coding to science domains, achieving gold-medal-level on 2025 Physics and Chemistry Olympiads and 50.5% on CMT-Benchmark; it also supports end-to-end engineering tasks like converting sketches to 3D-printable models.

Sentiment

The discussion is divided. There is genuine excitement about the ARC-AGI-2 score as a milestone in reasoning capability, but significant skepticism about whether it translates to practical improvement. Many power users report Gemini underperforming competitors like Claude and GPT in day-to-day tasks despite benchmark dominance, creating a credibility gap. The community respects the technical achievement while questioning its real-world significance.

In Agreement

Gemini 3 has unique strengths in academic, science, and math reasoning that have been stable for months, making it the preferred model for non-coding intellectual tasks
The ARC-AGI-2 score, certified by the ARC Prize Foundation, represents genuine progress in fluid intelligence and spatial reasoning
Gemini's ability to play Balatro from text descriptions alone demonstrates impressive generalization beyond what training data would directly enable
Inference costs are dropping rapidly, making expensive reasoning modes like Deep Think increasingly viable over time
The model excels at classical engineering tasks, physical system modeling, and scientific reasoning in ways competitors do not

Opposed

High benchmark scores do not match many users' practical experience — Gemini frequently ignores instructions, hallucinates, and underperforms Claude and GPT in day-to-day tasks
ARC-AGI benchmarks may be susceptible to gaming since frontier models run exclusively on providers' own hardware, making data leakage hard to prevent
Deep Think's high cost per task makes it impractical for real agent workflows, and the price-performance ratio is poor compared to competitors
Benchmark improvements do not necessarily translate to real-world capability gains, and specific score numbers may be inflated by benchmarkmaxxing
The AGI framing in benchmark names is misleading — solving visual puzzles does not demonstrate general intelligence, and ARC-AGI should be called a spatial reasoning benchmark
Gemini's Deep Research feature produces unreliable citations, contradicts itself, and invents terms — undermining claims of practical research utility