Prompted to Perform: A 22% Lift for GPT-5-mini on Tau² Telecom

Read Articleadded Sep 17, 2025
Prompted to Perform: A 22% Lift for GPT-5-mini on Tau² Telecom

Using the Tau² telecom benchmark, the authors found GPT-5-mini’s baseline pass rate was 55%. By having Claude rewrite domain policies into clear, step-by-step prompts with decision trees and explicit tool instructions, the model’s pass^1 rose to 0.675 (+22.73%) and pass^2 to 0.5 (+25%). The approach also halved unsolved tasks and lifted GPT-5-mini above o3 while retaining its speed and cost advantages.

Key Points

  • Baseline GPT-5-mini achieved a 55% success rate on Tau²’s telecom_small tasks across 40 simulations.
  • Policy documents were rewritten by Claude into clear, stepwise checklists with decision trees, explicit tool usage, binary conditions, and verification steps.
  • The prompt rewrite improved pass^1 from 0.55 to 0.675 (+22.73%) and pass^2 from 0.4 to 0.5 (+25%).
  • Previously unsolved tasks were halved (from 6 to 3), showing the agent could now handle cases it consistently failed before.
  • Optimized GPT-5-mini outperformed o3 (~58%) and approached GPT-5 performance while being faster and about five times cheaper.

Sentiment

Mixed to cautiously positive: interest in the approach and its results, but notable skepticism due to missing before/after prompts and concerns about baseline quality and reproducibility.

In Agreement

  • Structured, checklist-style prompts with decision trees, explicit tool calls, and binary conditions can materially improve smaller models’ reliability.
  • Using a stronger LLM to rewrite system or domain prompts is a practical, high-leverage tactic (“free alpha”).
  • Reducing cognitive load and ambiguity through prerequisites, error handling, and verification steps is beneficial for agent performance.
  • Further gains might be achievable by iteratively refining prompts against the hardest failure cases.

Opposed

  • The absence of before/after prompts undermines credibility and reproducibility; results are hard to trust without direct evidence.
  • Many of the cited techniques are standard prompt-engineering practices, so the improvement may reflect a weak baseline rather than a novel approach.
  • Skepticism that gains are generalizable without clearer methodology and transparency about the original prompt and constraints.
Prompted to Perform: A 22% Lift for GPT-5-mini on Tau² Telecom