The LLM Architecture Gallery: Mapping the Evolution of Open-Weight Models

This technical gallery provides a detailed comparison of prominent open-weight LLM architectures released between 2024 and 2026. It tracks the evolution of model design from standard dense decoders to complex Sparse Mixture-of-Experts and hybrid systems like Mamba-transformer blends. By documenting specific parameters and attention mechanisms, the resource highlights the industry's intense focus on inference efficiency and long-context performance.
Key Points
- The industry is rapidly transitioning from traditional dense architectures to Sparse Mixture-of-Experts (MoE) to optimize the ratio of total parameters to active inference costs.
- Advanced attention mechanisms such as Multi-Head Latent Attention (MLA) and sliding-window attention are becoming standard for reducing KV cache overhead in long-context models.
- A new wave of hybrid architectures is emerging, blending transformer blocks with alternative layers like Mamba-2, DeltaNet, or Lightning Attention for improved scaling.
- Normalization and stability techniques, including QK-Normalization and variations in residual layouts (pre-norm vs. post-norm), remain critical differentiators in modern training recipes.
- The gallery emphasizes transparency by providing direct links to tech reports, configuration files, and 'from scratch' implementation code for open-weight models.
Sentiment
Overwhelmingly positive toward the resource itself, with strong appreciation for the quality and presentation. The intellectual discussion leans toward a gentle counterpoint: while architecture matters, the gallery highlights how convergent modern designs are, and the real story of LLM progress lies in scale and training. This is more of a 'yes, and...' nuance than genuine disagreement.
In Agreement
- The gallery is an excellent, well-executed reference for understanding the internal architecture of modern LLMs, comparable to the Neural Network Zoo for earlier architectures.
- Architectural details like attention mechanisms and normalization techniques matter for practical concerns such as context window behavior, inference efficiency, and prompt pattern effectiveness.
- The resource fills an important gap for practitioners and researchers wanting a modular understanding of how real-world models are engineered.
Opposed
- The gallery inadvertently reveals how little fundamental architectural innovation has occurred since GPT-2 — modern models are still stacked attention and feed-forward layers, with the real gains coming from scaling and training methods like RLVR.
- Competitiveness in LLMs comes from scale, data quality, and fine-tuning rather than architecture, making the architectural differences cataloged less important than they might appear.
- The gallery would be more useful with evolutionary context — a family tree or influence layout showing how architectures descended from each other, rather than presenting them as independent entries.