From Sampling to Grammars: Making LLMs Reliably Output Structured Data (Even for Thinking Models)
The author outlines a practical, CPU-efficient sampling pipeline and how Ollama enforces structured outputs using grammars compiled from JSON Schemas. They show that letting thinking models complete their internal reasoning before constraining leads to higher quality than prefilling formats. Looking forward, structured output will increasingly become an intrinsic model capability, reducing reliance on strict masking.
Key Points
- Ollama’s sampling pipeline applies topK → temperature → softmax → topP → minP, then samples; CPU-side efficiency hinges on early topK pruning and linear passes over sorted logits.
- Structured outputs are enforced via grammars derived from JSON Schemas: sample, validate, accept or resample with masking—slower but guarantees correctness and improves as the model ‘grounds’.
- State machines provide an alternative path to the same guarantees and can be efficient; existing tools like llguidance suffice for grammar-constrained sampling.
- Thinking models (e.g., Harmony format) are sensitive to format; prefilling can harm quality—let the model finish thinking, use turn tokens to separate phases, then constrain the final output.
- gpt-oss is highly format-trained and can produce perfect JSON without explicit instruction; when constrained, it may even claim it was instructed to do so, reflecting strong alignment.
Sentiment
The community is broadly positive and technically engaged, with genuine interest in the article's explanation of sampling pipelines and structured output mechanisms. While there is healthy debate between different approaches — grammar-based constraints vs. post-processing vs. two-pass methods — the discussion is constructive rather than adversarial. Multiple library authors participate to share their work and answer questions, creating a rich knowledge-sharing dynamic.
In Agreement
- Grammar-based constrained generation is effective and can be made nearly free in latency when mask computation runs in parallel with GPU forward passes
- Models trained specifically for structured output show that JSON generation is increasingly becoming a trained capability rather than just a latent one
- The article provides a valuable and clear explanation of the sampling pipeline and structured output mechanisms
- Allowing thinking models to complete their reasoning before applying format constraints is important for maintaining output quality
Opposed
- Token-level grammar constraints skew the output probability distribution in undesirable ways, effectively changing the model's intended distribution rather than just filtering it
- Constrained generation can degrade reasoning quality; post-processing or two-pass approaches may produce better results for complex tasks
- Schema-aligned parsing via error-tolerant parsing after generation is preferable because it does not interfere with the model's generation process
- Perfectly outputting structured data is a bar that probabilistic sampling fundamentally cannot meet, regardless of training improvements