Inside a Tiny GPT: A Visual Walkthrough of Autoregressive Prediction

A tiny GPT model is used to visually explain how LLMs process and generate text. Tokens (A/B/C) are embedded into 48-dimensional vectors, passed through transformer layers, and mapped to next-token probabilities via softmax. The predicted token is fed back to continue generation, illustrating the full autoregressive loop.

Key Points

A minimal GPT (“nano-gpt,” ~85K parameters) is used to visually demonstrate how LLMs operate on a simple sorting task over tokens A, B, and C.
Tokens are mapped to indices (A=0, B=1, C=2) and then to 48-dimensional embeddings before entering the transformer stack.
The model processes embeddings through standard transformer components: layer normalization, causal multi-head self-attention, and an MLP.
A linear projection followed by softmax yields a probability distribution over the next token at each step.
Predictions are fed back into the model to generate sequences autoregressively, illustrating the full inference loop.

Sentiment

The Hacker News community overwhelmingly agrees that this is an outstanding educational resource. There is near-universal praise for the quality of the visualization, with debate limited to philosophical questions about whether understanding the math means understanding the intelligence. Minor skepticism about AI's broader value exists but is directed at the industry, not the visualization itself.

In Agreement

The visualization makes transformer architecture concrete and intuitive, serving as an excellent teaching tool for audiences ranging from children to professionals
The attention mechanism is elegantly simple and can be written on a napkin — the visualization faithfully represents the essential parts
Interactive visualizations like this empower scientists to break open the "black box" of LLMs and advance interpretability
The high vote-to-comment ratio signals universal admiration for a high-quality, non-contentious technical resource

Opposed

Even with complete visualization, we can't understand the model's actual decision-making — the interpretability gap remains
Seeing the internals reinforces that these models are statistical matrices doing next-token prediction, making AGI from this architecture seem implausible
Massive investment in AI is disproportionate to actual utility, and corporate pressure to adopt AI tools is misguided
The visualization doesn't go far enough — it needs actual model weights, customizable inputs, and training/backpropagation visualization to be truly useful