GPT-5 Thinking Makes ChatGPT a Surprisingly Competent Research Assistant

Simon Willison shows that GPT-5 Thinking in ChatGPT is now a highly capable search companion, combining reasoning with web tools to find, verify, and synthesize information—often better than manual searching. Through examples ranging from Heathrow travelators to Exeter Quay vaults, it not only cites credible sources and reads PDFs but also suggests next steps like drafting emails. While it’s slower and still needs human oversight, it delivers impressive depth on mobile and sets a benchmark for tool-augmented LLM research.

Key Points

GPT-5 Thinking integrates interleaved reasoning with web search and tools, producing slower but markedly more comprehensive and well-cited results.
In diverse real-world tasks (from identifying buildings and product availability to historical research and legal names), it surfaces authoritative sources, reads PDFs, and proposes concrete next steps.
It often outperforms equivalent manual searches due to rapid iteration and source evaluation, and it works exceptionally well on mobile with voice input.
Despite its competence, it still needs human oversight—inspecting its thought traces and guiding scope (e.g., asking for “vibes” vs. deep dives) improves outcomes.
For practitioners, it exemplifies the power of tool calling plus chain-of-thought (and multi-step RAG), serving as a gold standard for AI-assisted search workflows.

Sentiment

Mixed but leaning positive: many acknowledge GPT-5 Thinking’s interleaved, multi-pass search as a real improvement, while skeptics emphasize reliability gaps, slowness, SEO risks, and hype.

In Agreement

GPT-5 Thinking’s interleaved reasoning with iterative search and source evaluation is a meaningful upgrade over older “top-10 results” summarizers.
It is particularly good at digging up obscure or buried information (e.g., PDFs, datasheets, archival materials) and synthesizing it, often beating manual searches in time and effort.
Having the model generate searches and summarize findings—with links—makes verification easier than relying on “internal” model knowledge.
For many real-world, everyday queries, it’s already good enough to replace or front-load traditional search—especially on mobile.
The model often deprioritizes low-credibility sources and continues searching when initial results are weak, which users perceive as smarter behavior.
Compared to many “deep research” agents that produce bloated reports, GPT-5 Thinking can be more targeted and useful.

Opposed

It still produces subtly wrong or shallow answers, even when primary sources exist, so users must verify and can be misled by convincing prose.
SEO slop and content farms can still contaminate results; models may uncritically rephrase marketing material as fact.
Speed is a problem—Thinking mode can be too slow for interactive work; earlier o-series models sometimes retrieved certain facts better.
Some users prefer models with stronger internal encyclopedic knowledge over reliance on live web search.
Many “deep research” offerings (across vendors) are verbose, unfocused, and inferior to a few high-quality expert sources found manually.
The article’s examples are mundane and overhyped; concerns about author bias toward OpenAI and HN’s cult-of-personality dynamics also surface.
App UX issues (e.g., Android background disconnects) and safety guardrails (refusal to identify people in images) reduce practical utility.