Evolving English Instructions Sets New ARC SoTA and Points to RL for AGI

Berman sets new records on ARC-AGI by swapping Python solutions for English instructions within an evolutionary, multi-agent test-time compute framework. Grok-4 generates, tests, and evolves instructions through individual and pooled revisions, yielding strong accuracy gains and much lower costs. He argues ARC exposes dead reasoning zones in LLMs and that RL over reasoning is needed to achieve domain-agnostic, human-like generalization.

Key Points

New SoTA: 79.6% on ARC v1 at $8.42 per task and 29.4% on ARC v2, using Grok-4 with evolutionary test-time compute that evolves English instructions.
Architecture: generate, score, and evolve natural language instructions via individual and pooled revisions, capped at roughly 40 attempts per task to balance exploration and compute.
Why English over Python: ARC v2 tasks require nuanced, context-rich transformations that are fragile or overly complex in code but expressible in plain language.
Limits and trade-offs: pooled revisions help, but too many parents cause context bloat and degraded reasoning; small pools and staged refinement work best.
Position on AGI: current LLMs have dead reasoning zones and fragmented, domain-tied reasoning; RL on reasoning is needed to make consistent, transferable reasoning in-distribution, which he claims is the route to AGI.

Sentiment

Mixed, with respect for the engineering and cost/performance gains but overall skepticism about claims regarding true reasoning and AGI implications.

In Agreement

Using natural language as the search/program space is powerful for ARC v2; English handles nuanced, context-heavy pattern descriptions better than brittle Python code.
Evolutionary test-time compute with generate–execute–score–revise loops is an effective, pragmatic way to push performance and reduce cost per task.
Whether you call it reasoning or pattern matching, useful performance and generalization on real tasks (e.g., complex concurrent code tests) matters more than terminology.
Humans also rely heavily on pattern application; much knowledge work is applying known techniques to new contexts, which LLMs can already do.
Folding scaffolding into models (plans/ASTs, internal memory, self-revision) is a promising path; auto-synthesized scaffolding can scale capability.
ARC stresses spatial reasoning, an area where LLMs are weak; the method is a reasonable way to navigate that weakness while models and vision frontends catch up.
Sharing code and a reproducible pipeline is valuable for the community; the approach could be adapted to other domains.

Opposed

This looks like guided brute force with an oracle (ground-truth checker), not humanlike one-shot reasoning; many attempts plus feedback inflate success.
ARC is a contrived benchmark; improving it doesn’t prove general reasoning or AGI, and may reflect overfitting to the test.
LLMs still lack runtime learning and robust memory; they cannot retain or reuse discovered rules across steps without brittle scaffolding.
LLMs are notably poor at spatial reasoning and vision; current multimodal frontends are weak, so gains here say little about broader intelligence.
Labeling LLM failures as ‘dead reasoning zones’ and claiming RL enforces logical consistency is overstated; RL optimizes rewards, not formal logic.
Cost/energy and out-of-domain fragility suggest LLMs imitate reasoning rather than possess generalizable cognition.
Potential training contamination (models seeing ARC solutions) undermines claims of novelty and genuine generalization.