The Bitter Lesson Was About Data, Not Compute

We’ve misread the Bitter Lesson: scaling success was fundamentally driven by data, not compute alone. Scaling laws (Chinchilla) imply C ~ D^2, so increasing compute efficiently requires substantial additional high-quality data—yet the usable Internet is nearly exhausted. The path forward is a portfolio that combines architectural advances to squeeze more from fixed data with alchemical methods that generate new, high-signal data.
Key Points
- Scaling laws imply compute is yoked to data: with N ~ D and C ~ N·D, optimal training yields C ~ D^2; doubling compute requires ~40% more high-quality data to be efficient.
- High-quality Internet text/code is scarce (~10T usable tokens after filtering), so the practical bottleneck is data, not GPUs; there is no ‘second Internet.’
- The Bitter Lesson is longitudinal (over decades) but not cross-sectional; at fixed data, progress must come from better architectures and/or new data generation.
- Two intertwined paths: the Architect (structural/architectural advances like Mamba, HRM, ParScale) delivers steady gains, while the Alchemist (self-play, RLHF/DPO, agentic traces) is high-variance but can deliver step-changes.
- Leaders should build a risk-balanced research portfolio (e.g., 70/30 splits) and demand a data roadmap; those who solve data scarcity will outcompete compute-only strategies.
Sentiment
The community is broadly sympathetic to the general premise that data constraints are underappreciated relative to compute, but significant factions push back on the specifics. Many commenters find the article's core insight obvious or already well-understood in the ML community. Others argue the article misreads the Bitter Lesson by reframing it around data when the original essay is about general methods versus human knowledge engineering. The most engaged threads explore whether the data bottleneck is fundamental or merely a limitation of current architectures, with no clear consensus emerging.
In Agreement
- Data is the real bottleneck, not compute — we've essentially exhausted high-quality internet text and need new sources or methods to continue scaling
- Verifiable rewards and self-play (as in AlphaZero and DeepSeek R1) are the most promising path to generating new high-quality training data
- Real-world enterprise data is riddled with errors, making naive 'more data' approaches futile — data quality and curation matter more than quantity
- LLMs operate in a 'shadow world' of language that is fundamentally simpler than physical reality, supporting the view that data rather than compute is the limiting factor
- Embodiment and robotics face an even more severe data bottleneck, with exponentially more degrees of freedom and exponentially less training data available
- The distinction between Architect (better architectures) and Alchemist (better data generation) paths is a useful framing for corporate AI strategy
Opposed
- The Bitter Lesson never mentions data — it is specifically about general methods (search and learning) beating human-engineered knowledge, and the article fundamentally misunderstands it
- Data scarcity is not a fundamental problem but a limitation of the current paradigm — humans learn from far less data, suggesting architectural breakthroughs can overcome the wall
- Untapped data sources exist in abundance: video, audio, real-world sensor data, multimodal inputs, historical texts in non-English languages, and physical simulation
- The article's thesis is obvious to anyone following scaling laws research and adds little novel insight — it's unnecessarily sensationalist
- Synthetic data through RL with verifiable rewards already solves the data problem for domains like math and code, making the 'no second internet' framing misleading
- Inference-time scaling and post-training RL can unlock capabilities already latent in models without requiring more pre-training data