Unlocking AI’s Data: ABC and an ARPANET-Style Plan

AI’s bottleneck is not a lack of data but lack of access: frontier models train on hundreds of terabytes while the world holds zettabytes of high-quality, private data. The authors propose Attribution-Based Control (ABC), which lets data owners govern and monetize per-prediction use and lets users select sources, enabled by model partitioning and privacy-enhancing technologies. They urge an ARPANET-style U.S. program—DARPA, NSF, and NIST—to integrate, deploy, and standardize ABC to unlock orders-of-magnitude more training data.
Key Points
- AI’s biggest breakthroughs have tracked surges in available, high-quality data; current frontier models use only hundreds of terabytes while the world holds zettabytes—access, not scarcity, is the binding constraint.
- Information markets fail for data: copying destroys control and price, and infinite reuse makes buyers competitors, so owners hoard data and build moats.
- Attribution-Based Control (ABC) realigns incentives by letting data owners control which predictions their data supports and letting users choose sources, enabling ongoing, per-use monetization without surrendering control.
- Model partitioning (MoE, RAG, RETRO/ATLAS, model merging, federated MoE) plus privacy-enhancing tech (confidential GPU enclaves, homomorphic encryption, federated learning, MPC, ZK proofs, differential privacy) make ABC practically achievable with modest overhead.
- An ARPANET-style federal program—DARPA to build and test ABC systems, NSF to fund early institutional adoption, and NIST to set global standards—can unlock a million-times more data for AI while preserving rights and privacy.
Sentiment
The overall sentiment of the Hacker News discussion is highly skeptical and critical regarding the article's proposed solution and underlying motivations. While the inherent value of private data is implicitly recognized, the approach to unlocking it for AI is met with strong privacy concerns and dismissals of the technological solution.
In Agreement
- The value and unique quality of locked private data (e.g., electronic health records, financial transactions, industrial sensor readings) for specialized AI applications is implicitly acknowledged, aligning with the article's premise that such data is highly desirable for AI.
- The potential for a market or exchange for this valuable data is considered, suggesting an acceptance of the idea that this data could be accessed and monetized in some structured way.
Opposed
- The article's proposal is viewed skeptically as a think-tank's effort to enable companies to gain access to sensitive private medical and other personal data.
- The proposed "solution" for privacy (Attribution-Based Control) is dismissed as resembling an overhyped and potentially ineffective "blockchain pitch circa-2019."