The Bitter Lesson Was About Data, Not Compute
Read ArticleRead Original Articleadded Sep 3, 2025September 3, 2025

We’ve misread the Bitter Lesson: scaling success was fundamentally driven by data, not compute alone. Scaling laws (Chinchilla) imply C ~ D^2, so increasing compute efficiently requires substantial additional high-quality data—yet the usable Internet is nearly exhausted. The path forward is a portfolio that combines architectural advances to squeeze more from fixed data with alchemical methods that generate new, high-signal data.
Key Points
- Scaling laws imply compute is yoked to data: with N ~ D and C ~ N·D, optimal training yields C ~ D^2; doubling compute requires ~40% more high-quality data to be efficient.
- High-quality Internet text/code is scarce (~10T usable tokens after filtering), so the practical bottleneck is data, not GPUs; there is no ‘second Internet.’
- The Bitter Lesson is longitudinal (over decades) but not cross-sectional; at fixed data, progress must come from better architectures and/or new data generation.
- Two intertwined paths: the Architect (structural/architectural advances like Mamba, HRM, ParScale) delivers steady gains, while the Alchemist (self-play, RLHF/DPO, agentic traces) is high-variance but can deliver step-changes.
- Leaders should build a risk-balanced research portfolio (e.g., 70/30 splits) and demand a data roadmap; those who solve data scarcity will outcompete compute-only strategies.
Sentiment
Nuanced and mixed, with moderate agreement that data—not just compute—is the current binding constraint for frontier LMs, alongside strong pushback that synthetic/multimodal/user data and new methods can evade or delay the wall.
In Agreement
- Compute without proportional, high‑quality data wastes spend; Chinchilla‑style scaling implies C ~ D^2 and we’re near the usable Internet text/code limit.
- Progress now requires a dual strategy: Architect (better inductive biases, data‑efficient architectures, test‑time scaling) plus Alchemist (self‑play, RLHF/DPO, agentic loops) to generate high‑signal, verifiable data.
- Verifiable rewards enable effectively unbounded synthetic data in domains like math and code; AlphaZero and DeepSeek R1‑Zero show this can drive step‑function gains.
- Embodiment/robotics face far greater degrees of freedom and sparse data; verifiability, simulation, and engineering constraints support focusing on verifiable LM domains first.
- Leaders should attach data roadmaps to compute requests and construct portfolios (incumbents favor architecture; challengers can bet on alchemy) to avoid standing still.
Opposed
- There isn’t a hard data wall: multimodal (video/audio/3D/sensors), undigitized historical corpora, and user interaction telemetry can still supply massive data; scaling can continue even if not Chinchilla‑optimal.
- Synthetic/self‑play data in verified domains already sidesteps human data limits; the bottleneck claim is overstated given verifiers and distillation.
- Sutton’s Bitter Lesson is about compute‑driven general methods, not data; emphasizing human data is itself a human‑centric bias that a true general method should transcend.
- Synthetic data may not add new information and risks model collapse without grounding; verifiability applies to narrow domains, limiting generality.
- Robotics’ lag is driven more by hardware reliability and economics than AI; embodiment difficulty doesn’t prove a universal data bottleneck for cognition.
- Better objectives, curriculum learning, and high‑SNR data (e.g., Phi, chain‑of‑thought) can cut data needs; inference/test‑time scaling can unlock latent capability without more pretraining tokens.