Unlocking AI’s Data: ABC and an ARPANET-Style Plan

AI’s bottleneck is not a lack of data but lack of access: frontier models train on hundreds of terabytes while the world holds zettabytes of high-quality, private data. The authors propose Attribution-Based Control (ABC), which lets data owners govern and monetize per-prediction use and lets users select sources, enabled by model partitioning and privacy-enhancing technologies. They urge an ARPANET-style U.S. program—DARPA, NSF, and NIST—to integrate, deploy, and standardize ABC to unlock orders-of-magnitude more training data.
Key Points
- AI’s biggest breakthroughs have tracked surges in available, high-quality data; current frontier models use only hundreds of terabytes while the world holds zettabytes—access, not scarcity, is the binding constraint.
- Information markets fail for data: copying destroys control and price, and infinite reuse makes buyers competitors, so owners hoard data and build moats.
- Attribution-Based Control (ABC) realigns incentives by letting data owners control which predictions their data supports and letting users choose sources, enabling ongoing, per-use monetization without surrendering control.
- Model partitioning (MoE, RAG, RETRO/ATLAS, model merging, federated MoE) plus privacy-enhancing tech (confidential GPU enclaves, homomorphic encryption, federated learning, MPC, ZK proofs, differential privacy) make ABC practically achievable with modest overhead.
- An ARPANET-style federal program—DARPA to build and test ABC systems, NSF to fund early institutional adoption, and NIST to set global standards—can unlock a million-times more data for AI while preserving rights and privacy.
Sentiment
The Hacker News community is predominantly skeptical and dismissive. While the article's author engages respectfully and constructively with nearly every comment, commenters largely reject both the premise that AI needs vastly more data and the practicality of the proposed solutions. The discussion reflects deep distrust of AI companies' intentions regarding private data, and most see the ABC framework as either technically infeasible or practically naive.
In Agreement
- Private data such as health records and financial transactions could be uniquely valuable for AI if access problems were solved, particularly for specialized applications like medical AI
- A data exchange or commodity market for training data is an interesting concept worth exploring
- The question of how much quality data remains untapped is genuinely important for AI's future
- Homomorphic addition for federated averaging is technically performant, and companies like Zama have working products in this space
- Attribution-based control, if achievable, would be a step in the right direction for data rights enforcement
Opposed
- More data doesn't automatically improve models — the industry is moving toward synthetic data, reasoning, and reinforcement learning rather than bigger pretraining datasets
- The million-times-more-data framing is misleading since most of the world's data is irrelevant for LLM training (videos, games, sensor readings, duplicates)
- The ABC framework is impractical, reminiscent of the failed Semantic Web vision of labeling all the world's data with rich metadata
- AI companies currently fighting to avoid paying for scraped data cannot be trusted to respect private data rights under any framework
- Homomorphic encryption adds massive computational overhead and is unproven for GPU-based ML at scale
- Private data isn't inherently high-quality — corporate scandals like Enron, Theranos, and FTX demonstrate that real-world grounding doesn't guarantee accuracy
- The article reads like think-tank buzzword nonsense targeting a policy audience and lacking technical rigor
- The proposal essentially enables companies to access private medical and personal data under a thin privacy veneer