Standardize LLM Observability on OpenTelemetry

Chatwoot’s production issues with an AI agent highlight the need for deep LLM observability. While OpenInference provides AI-centric span types, its partial adherence to OpenTelemetry and limited language support create practical integration problems. The author argues teams should standardize on OpenTelemetry, enrich spans with AI attributes, and help advance OTel’s GenAI semantics—an approach SigNoz is actively supporting.

Key Points

Production LLMs need step-level visibility (RAG documents, tool calls, inputs/outputs, decisions) to debug real issues.
OpenTelemetry is the most mature, widely supported standard, but its span kinds are generic for AI workflows.
OpenInference offers AI-native span types and a better LLM-focused UX, yet it lacks full OTel semantic compatibility and broad language SDKs (e.g., Ruby).
Mixing telemetry standards fragments observability and breaks out-of-the-box, OTel-based features across the stack.
Best practice: use OpenTelemetry as the backbone, enrich spans with AI attributes, and contribute to OTel’s GenAI semantic conventions.

Sentiment

The overall sentiment is mixed but largely pragmatic. While there's a strong consensus on the *need* for granular LLM observability, there's considerable debate regarding the *how*, particularly concerning OpenTelemetry's suitability as the sole standard. Many commenters agree with the article's premise about OTel's foundational role and the existing compatibility issues with AI-specific tools. However, a significant portion expresses skepticism about OTel's current richness for complex LLM systems, suggests alternative approaches, or prefers specialized AI observability platforms, highlighting that the community is still actively exploring and debating the best path forward.

In Agreement

There is a critical and universally acknowledged need for granular observability in LLM agent systems, including tracing tool calls, decision paths, costs, and context management, to diagnose unpredictable behavior.
The specific technical issue of 'unknown' span kinds when using OpenInference/Phoenix with OpenTelemetry is a recognized problem, confirming the standards compatibility challenges highlighted in the article.
OpenTelemetry is a strong contender for the foundational standard, and adherence to its emerging GenAI semantic conventions is the recommended future direction.
The 'standards rift' between general observability frameworks (OTel) and AI-specific tooling (like OpenInference) is a real and significant challenge in the LLM observability space.

Opposed

OpenTelemetry, despite its generality, may lack the inherent semantic richness required for complex multi-agent LLM systems, and simply 'adding attributes' might be insufficient, suggesting a need for hybrid or more specialized standards.
Alternative, simpler approaches, such as using existing conversation logs, can serve as a more natural and sufficient means of tracing agent execution flow without requiring additional monitoring infrastructure.
Specialized AI observability tools like Phoenix (despite OTel compatibility issues) offer superior user experience for experimental AI work, or other tools like Langfuse and W&B are preferred, indicating a fragmented and competitive tooling market.
The fundamental problem of observing the internal workings and interpretability of LLMs themselves is a much harder, possibly unsolvable problem, distinct from tracing their external interactions.