OTelBench: LLMs Still Can’t Reliably Instrument Distributed Tracing

OTelBench tests 14 LLMs on real OpenTelemetry instrumentation tasks across 11 languages and finds poor reliability, with the top model achieving just a 29% pass rate. Models frequently produce compiling code that yields malformed traces, often by merging distinct user actions due to incorrect context handling. While a few models offer decent cost-speed trade-offs, dependable AI-driven SRE remains out of reach, and engineers should expect to implement tracing manually for now.

Key Points

Across 14 frontier models, OpenTelemetry instrumentation tasks had low success rates; the best model (Claude Opus 4.5) passed only 29%.
OTelBench evaluates realistic, polyglot microservices (11 languages, 23 tasks) and checks not just compilation but trace correctness and context propagation.
A prevalent failure mode is conflating separate user actions into one trace due to improper context handling, producing malformed spans and relationships.
Language outcomes vary widely: C++ led at 37% (simpler task), Go at 20%, mid-tier results for JS/Python/PHP/.NET, and zero successes in Swift, Ruby, and Java.
Cost-speed-performance trade-offs favor a small Pareto set (notably Gemini 3 Flash for value), but overall reliability is insufficient for production SRE needs.

Sentiment

Mixed but leaning toward constructive skepticism of the benchmark methodology. The community accepts LLMs struggle with OTel instrumentation but pushes back significantly on the framing, arguing the benchmark's sparse prompts and one-shot approach do not reflect how AI tools are effectively used in real-world SRE. There is more energy spent critiquing the benchmark design than celebrating the finding.

In Agreement

LLMs fundamentally struggle with tasks requiring precise specification conformance across multiple requirements, unlike flexible code generation where many outputs are acceptable
There is a significant training data gap for operational and SRE tasks compared to software development, explaining poor model performance in this domain
Models exhibit systematic failures rather than random ones — most benchmark tasks are never solved regardless of how many times they are retried
Even basic operational tasks like identifying and killing the correct process are unreliable with current frontier models
Enterprise observability is poorly standardized even among humans, making it an especially difficult domain for AI to learn from limited examples

Opposed

The benchmark prompts are too sparse and lack documentation, SOPs, and testing loops that would be provided to any competent human — better prompting and context dramatically improve results
Real-world AI SRE tools like HolmesGPT show very different results when models are given proper tooling and structured context, including success at Fortune 500 companies
OTel instrumentation is genuinely difficult even for experienced humans and should not be characterized as a 'simple SRE task'
One developer successfully implemented OTel across a 200k LOC project using Claude Code by providing typed wrappers and documentation, showing the approach matters more than raw model capability
The benchmark measures unassisted one-shot performance rather than the iterative, context-rich workflow where AI actually provides value in SRE