Tokenization

How text is split into tokens for language models, including tokenizer design, vocabulary construction, byte-pair encoding, and the downstream effects of tokenization choices on model behavior and output.

Reading List

Damage Control

The Tokenizer Tax: Why AI Sticker Prices Are a Lie

Jul 13, 2026153

The 'price per million tokens' metric is a misleading illusion because different model tokenizers can quietly inflate your bill by over 70% for the exact same code.

Tokenization AI Business Models Token Optimization Anthropic OpenAI

Under the Hood

The Hidden 30% Tax in Claude 4.7

Apr 17, 2026707

Claude 4.7 uses significantly more tokens for the same text, increasing session costs by ~30% in exchange for better instruction following.

Tokenization Anthropic AI Coding Agents Token Optimization Technology Economics

Under the Hood

When ‘Seahorse + Emoji’ Hits an Empty Token: Why LLMs Invent the Seahorse Emoji

Oct 6, 2025734

Models compose “seahorse + emoji,” but with no matching token the unembedding snaps to a nearby emoji, causing confident errors and occasional feedback loops.

AI Hallucinations AI Interpretability Transformer Models Tokenization