Prompted to Perform: A 22% Lift for GPT-5-mini on Tau² Telecom

Using the Tau² telecom benchmark, the authors found GPT-5-mini’s baseline pass rate was 55%. By having Claude rewrite domain policies into clear, step-by-step prompts with decision trees and explicit tool instructions, the model’s pass^1 rose to 0.675 (+22.73%) and pass^2 to 0.5 (+25%). The approach also halved unsolved tasks and lifted GPT-5-mini above o3 while retaining its speed and cost advantages.

Key Points

Baseline GPT-5-mini achieved a 55% success rate on Tau²’s telecom_small tasks across 40 simulations.
Policy documents were rewritten by Claude into clear, stepwise checklists with decision trees, explicit tool usage, binary conditions, and verification steps.
The prompt rewrite improved pass^1 from 0.55 to 0.675 (+22.73%) and pass^2 from 0.4 to 0.5 (+25%).
Previously unsolved tasks were halved (from 6 to 3), showing the agent could now handle cases it consistently failed before.
Optimized GPT-5-mini outperformed o3 (~58%) and approached GPT-5 performance while being faster and about five times cheaper.

Sentiment

Mixed to cautiously positive: interest in the approach and its results, but notable skepticism due to missing before/after prompts and concerns about baseline quality and reproducibility.

In Agreement

Structured, checklist-style prompts with decision trees, explicit tool calls, and binary conditions can materially improve smaller models’ reliability.
Using a stronger LLM to rewrite system or domain prompts is a practical, high-leverage tactic (“free alpha”).
Reducing cognitive load and ambiguity through prerequisites, error handling, and verification steps is beneficial for agent performance.
Further gains might be achievable by iteratively refining prompts against the hardest failure cases.

Opposed

The absence of before/after prompts undermines credibility and reproducibility; results are hard to trust without direct evidence.
Many of the cited techniques are standard prompt-engineering practices, so the improvement may reflect a weak baseline rather than a novel approach.
Skepticism that gains are generalizable without clearer methodology and transparency about the original prompt and constraints.