Prompted to Perform: A 22% Lift for GPT-5-mini on Tau² Telecom
Read ArticleRead Original Articleadded Sep 17, 2025September 17, 2025

Using the Tau² telecom benchmark, the authors found GPT-5-mini’s baseline pass rate was 55%. By having Claude rewrite domain policies into clear, step-by-step prompts with decision trees and explicit tool instructions, the model’s pass^1 rose to 0.675 (+22.73%) and pass^2 to 0.5 (+25%). The approach also halved unsolved tasks and lifted GPT-5-mini above o3 while retaining its speed and cost advantages.
Key Points
- Baseline GPT-5-mini achieved a 55% success rate on Tau²’s telecom_small tasks across 40 simulations.
- Policy documents were rewritten by Claude into clear, stepwise checklists with decision trees, explicit tool usage, binary conditions, and verification steps.
- The prompt rewrite improved pass^1 from 0.55 to 0.675 (+22.73%) and pass^2 from 0.4 to 0.5 (+25%).
- Previously unsolved tasks were halved (from 6 to 3), showing the agent could now handle cases it consistently failed before.
- Optimized GPT-5-mini outperformed o3 (~58%) and approached GPT-5 performance while being faster and about five times cheaper.
Sentiment
Mixed to cautiously positive: interest in the approach and its results, but notable skepticism due to missing before/after prompts and concerns about baseline quality and reproducibility.
In Agreement
- Structured, checklist-style prompts with decision trees, explicit tool calls, and binary conditions can materially improve smaller models’ reliability.
- Using a stronger LLM to rewrite system or domain prompts is a practical, high-leverage tactic (“free alpha”).
- Reducing cognitive load and ambiguity through prerequisites, error handling, and verification steps is beneficial for agent performance.
- Further gains might be achievable by iteratively refining prompts against the hardest failure cases.
Opposed
- The absence of before/after prompts undermines credibility and reproducibility; results are hard to trust without direct evidence.
- Many of the cited techniques are standard prompt-engineering practices, so the improvement may reflect a weak baseline rather than a novel approach.
- Skepticism that gains are generalizable without clearer methodology and transparency about the original prompt and constraints.