Prompted to Perform: A 22% Lift for GPT-5-mini on Tau² Telecom
Article: PositiveCommunity: NeutralMixed

Using the Tau² telecom benchmark, the authors found GPT-5-mini’s baseline pass rate was 55%. By having Claude rewrite domain policies into clear, step-by-step prompts with decision trees and explicit tool instructions, the model’s pass^1 rose to 0.675 (+22.73%) and pass^2 to 0.5 (+25%). The approach also halved unsolved tasks and lifted GPT-5-mini above o3 while retaining its speed and cost advantages.
Key Points
- Baseline GPT-5-mini achieved a 55% success rate on Tau²’s telecom_small tasks across 40 simulations.
- Policy documents were rewritten by Claude into clear, stepwise checklists with decision trees, explicit tool usage, binary conditions, and verification steps.
- The prompt rewrite improved pass^1 from 0.55 to 0.675 (+22.73%) and pass^2 from 0.4 to 0.5 (+25%).
- Previously unsolved tasks were halved (from 6 to 3), showing the agent could now handle cases it consistently failed before.
- Optimized GPT-5-mini outperformed o3 (~58%) and approached GPT-5 performance while being faster and about five times cheaper.
Sentiment
The community is engaged but divided. There is genuine interest in the practical result and appreciation for the OpenAI insider perspective, but notable skepticism about the novelty and generalizability of the approach. Multiple commenters see it as retreading ground already covered by DSPy and prompt engineering research.
In Agreement
- Structured, step-by-step prompts reduce cognitive load for smaller models and can meaningfully improve their reliability on complex agentic tasks
- Using a stronger model to rewrite prompts for a cheaper model is a one-time optimization cost that pays ongoing dividends in latency and cost savings
- Clear writing and structured instructions are becoming crucial skills as programming enters a natural language phase
- The OpenAI insider perspective confirms the Telecom benchmark is well-designed and the results reflect genuine capability improvements
Opposed
- DSPy and academic research have already formalized systematic prompt optimization, making this approach well-trodden rather than novel
- Prompt optimizations are fragile and model-specific—they will likely break with each new model release
- Heavily structuring prompts may undermine what the benchmark is actually testing, essentially teaching to the test rather than measuring true agentic capabilities
- The approach may not generalize beyond the specific telecom domain to other use cases like medical or social contexts
- OpenAI should have already caught and optimized for such prompt engineering gaps, casting doubt on the benchmark's real-world significance