ARC-AGI-3: Measuring Human-Like Learning in AI Agents

ARC-AGI-3 is an interactive benchmark that tests AI agents' ability to learn and adapt in novel environments without pre-loaded knowledge. It measures intelligence based on skill-acquisition efficiency and planning horizons, mirroring human reasoning capabilities. The platform aims to quantify and eventually close the gap between current AI performance and human-level general intelligence.

Key Points

ARC-AGI-3 shifts from static puzzle-solving to interactive learning within novel environments.
It measures intelligence through skill-acquisition efficiency and long-horizon planning rather than just final outputs.
The benchmark is designed to be 100% human-solvable while remaining resistant to AI brute-force memorization.
It provides a developer toolkit and replay system to transparently evaluate and iterate on AI agent reasoning.

Sentiment

The community is deeply divided. A significant faction defends the benchmark's methodology as a necessary and well-designed test for genuine general intelligence, while an equally vocal faction argues the scoring is misleading, the tool restrictions are arbitrary, and the benchmark lacks practical predictive value. The presence of François Chollet in the discussion adds credibility but also draws sharper criticism from skeptics. The debate frequently drifts from the benchmark itself into fundamental disagreements about what AGI means and whether current models qualify.

In Agreement

The scoring methodology is defensible because it prevents brute-force approaches and tests genuine learning efficiency, which is the core of what general intelligence means
Using the second-best human as baseline rather than average is appropriate because AGI should aspire to high human capability, not just median performance
Requiring models to work without specialized harnesses is the right approach because a truly generally intelligent agent should be able to figure out what tools it needs on its own
ARC-AGI-3 correctly tests a critical capability gap — in-context learning and world-model building — that current frontier models genuinely lack
Step efficiency matters because real-world actions have costs and externalities, so brute-forcing solutions with unlimited compute is not equivalent to intelligent behavior
The benchmark is useful precisely because frontier models score so poorly, highlighting a genuine and important gap between AI and human learning

Opposed

The scoring is so convoluted and punitive that it obscures what models can actually do — a score of 5% could mean anything from solving almost nothing to solving everything inefficiently
Not allowing basic agent tools like code execution is an arbitrary and unfair constraint that handicaps models without testing intelligence, since humans also use tools to solve problems
Models given visual input or basic tools (the Duke harness) score dramatically higher, suggesting the benchmark tests input format limitations more than intelligence
The human baseline is misleading because even average humans would score below 25% on this scale, making the gap between humans and AI appear larger than it actually is
ARC-AGI has zero practical utility and no predictive power for real-world capabilities, unlike benchmarks such as FrontierMath which correlate with actual research contributions
The benchmark seems designed to stay unsaturated so ARC can claim credit when the 'continual learning breakthrough' eventually happens regardless of their role in it