GPT-5 Thinking Makes ChatGPT a Surprisingly Competent Research Assistant

Simon Willison shows that GPT-5 Thinking in ChatGPT is now a highly capable search companion, combining reasoning with web tools to find, verify, and synthesize information—often better than manual searching. Through examples ranging from Heathrow travelators to Exeter Quay vaults, it not only cites credible sources and reads PDFs but also suggests next steps like drafting emails. While it’s slower and still needs human oversight, it delivers impressive depth on mobile and sets a benchmark for tool-augmented LLM research.
Key Points
- GPT-5 Thinking integrates interleaved reasoning with web search and tools, producing slower but markedly more comprehensive and well-cited results.
- In diverse real-world tasks (from identifying buildings and product availability to historical research and legal names), it surfaces authoritative sources, reads PDFs, and proposes concrete next steps.
- It often outperforms equivalent manual searches due to rapid iteration and source evaluation, and it works exceptionally well on mobile with voice input.
- Despite its competence, it still needs human oversight—inspecting its thought traces and guiding scope (e.g., asking for “vibes” vs. deep dives) improves outcomes.
- For practitioners, it exemplifies the power of tool calling plus chain-of-thought (and multi-step RAG), serving as a gold standard for AI-assisted search workflows.
Sentiment
The discussion is cautiously positive but heavily qualified. The community largely accepts that GPT-5 Thinking represents a meaningful improvement in AI-assisted search, particularly for complex multi-step queries, but pushes back strongly on any suggestion that it replaces careful manual research. The dominant sentiment is 'useful tool with important limitations' rather than revolutionary breakthrough.
In Agreement
- GPT-5 Thinking's interleaved search-and-reasoning approach is genuinely better than previous AI search, which uncritically summarized top results without evaluation
- The fire-and-forget parallelization benefit is real — users can do other things while the model searches and synthesizes, unlike manual Google searches that demand active attention
- LLM search excels at tasks requiring digestion of many sources quickly, such as solving obscure tip-of-my-tongue queries or finding product datasheets buried under marketing material
- GPT-5 is notably more critical of sources than competitors like Gemini or Grok, comparing and evaluating results rather than just summarizing whatever appears first
- The convenience of not having to manually sift through SEO-optimized content farms is a genuine quality-of-life improvement, especially on mobile
Opposed
- For many common queries, traditional Google search is faster and produces equivalent or better results — a commenter systematically demonstrated this across most of the article's examples
- LLMs still struggle with source credibility, potentially presenting forum speculation or marketing material as authoritative, especially on niche topics
- The Research Goblin framing overhypes what is essentially competent search automation — the mundane examples in the article do not demonstrate anything revolutionary beyond what prior models could do
- ChatGPT fails to surface primary sources, especially non-English ones, over-relying on English secondary literature like Wikipedia — a critical limitation for any tool marketed as research-grade
- Sycophancy remains a fundamental problem: the model tends to find evidence supporting whatever position the user implies, making it unreliable for genuinely balanced research
- Users lose the opportunity to exercise their own judgment when answers are pre-digested, and may uncritically accept incorrect information presented with false confidence