From Sampling to Grammars: Making LLMs Reliably Output Structured Data (Even for Thinking Models)

The author outlines a practical, CPU-efficient sampling pipeline and how Ollama enforces structured outputs using grammars compiled from JSON Schemas. They show that letting thinking models complete their internal reasoning before constraining leads to higher quality than prefilling formats. Looking forward, structured output will increasingly become an intrinsic model capability, reducing reliance on strict masking.

Key Points

Ollama’s sampling pipeline applies topK → temperature → softmax → topP → minP, then samples; CPU-side efficiency hinges on early topK pruning and linear passes over sorted logits.
Structured outputs are enforced via grammars derived from JSON Schemas: sample, validate, accept or resample with masking—slower but guarantees correctness and improves as the model ‘grounds’.
State machines provide an alternative path to the same guarantees and can be efficient; existing tools like llguidance suffice for grammar-constrained sampling.
Thinking models (e.g., Harmony format) are sensitive to format; prefilling can harm quality—let the model finish thinking, use turn tokens to separate phases, then constrain the final output.
gpt-oss is highly format-trained and can produce perfect JSON without explicit instruction; when constrained, it may even claim it was instructed to do so, reflecting strong alignment.

Sentiment

The overall sentiment of the Hacker News discussion is highly engaged and largely in agreement with the article's premise about the critical importance and complexity of structured outputs in LLM generation. While there are constructive questions and discussions around alternative implementations and practical tradeoffs, these are framed within an overall acceptance of the article's core ideas and the significance of the topic.

In Agreement

The tradeoff between direct constrained generation and a multi-pass approach (free generation + structured conversion) is a valid consideration, especially for semantic quality, aligning with the article's point about models being sensitive to prefilled structures.
Structured output significantly enhances the observability and programmability of LLM agents.
Constrained generation effectively guarantees syntactic correctness, while a second pass can improve semantic accuracy.
The internal mechanisms of sampling and grammar application, as detailed in the article, are complex and fundamental to LLM generation.

Opposed

The article's described method of 'sample, check, and resample with masking' for structured output is questioned for its efficiency; some suggest applying the mask immediately instead.
The token masking and resampling process for structured output can be perceived as 'brute force' or inefficient.
API providers' current lack of comprehensive custom grammar support (beyond JSON Schema) is a limitation for various use cases, although some commenters acknowledge the potential cost and batching implications for providers.