Evolving English Instructions Sets New ARC SoTA and Points to RL for AGI

Added Sep 17, 2025
Article: PositiveCommunity: NeutralDivisive
Evolving English Instructions Sets New ARC SoTA and Points to RL for AGI

Berman sets new records on ARC-AGI by swapping Python solutions for English instructions within an evolutionary, multi-agent test-time compute framework. Grok-4 generates, tests, and evolves instructions through individual and pooled revisions, yielding strong accuracy gains and much lower costs. He argues ARC exposes dead reasoning zones in LLMs and that RL over reasoning is needed to achieve domain-agnostic, human-like generalization.

Key Points

  • New SoTA: 79.6% on ARC v1 at $8.42 per task and 29.4% on ARC v2, using Grok-4 with evolutionary test-time compute that evolves English instructions.
  • Architecture: generate, score, and evolve natural language instructions via individual and pooled revisions, capped at roughly 40 attempts per task to balance exploration and compute.
  • Why English over Python: ARC v2 tasks require nuanced, context-rich transformations that are fragile or overly complex in code but expressible in plain language.
  • Limits and trade-offs: pooled revisions help, but too many parents cause context bloat and degraded reasoning; small pools and staged refinement work best.
  • Position on AGI: current LLMs have dead reasoning zones and fragmented, domain-tied reasoning; RL on reasoning is needed to make consistent, transferable reasoning in-distribution, which he claims is the route to AGI.

Sentiment

The HN community is predominantly skeptical about the article's broader AGI claims while showing moderate interest in the specific technical approach. The majority of discussion energy goes toward debating whether LLMs reason at all, with skeptics slightly outnumbering defenders. Most commenters accept that LLMs are useful tools but reject the framing that benchmark improvements point toward AGI. There is widespread agreement that spatial reasoning is a genuine LLM weakness and that ARC-AGI specifically targets it.

In Agreement

  • LLMs demonstrate practical capabilities equivalent to reasoning when they can apply learned techniques to novel situations, regardless of the underlying mechanism
  • The pattern matching vs reasoning distinction is meaningless without a rigorous definition of reasoning, and humans also reason by pattern matching according to psychological literature
  • ARC-AGI failures reflect LLMs' specific weakness in spatial reasoning rather than a fundamental absence of intelligence
  • Evolutionary English instruction approaches are genuinely promising and resemble AlphaEvolve-style search over natural language
  • Newer reasoning models have demonstrably overcome earlier failures like modified riddles, suggesting genuine improvement
  • Mechanistic interpretability research has found latent world models and abstract reasoning circuits inside LLMs, suggesting something deeper than surface pattern matching

Opposed

  • LLMs are not reasoning — they are pattern matching on training data and fail on out-of-distribution tasks, making them sophisticated expert systems
  • LLMs cannot learn at runtime, which is a fundamental requirement for genuine intelligence; they can apply rules but cannot discover new ones through experimentation
  • ARC-AGI results do not indicate progress toward AGI; this is just slightly smarter brute forcing on a contrived puzzle
  • LLMs' memory limitations are crippling for agentic tasks — they forget their own correct reasoning between turns and cannot maintain abstract concepts over multi-step problems
  • Text-based LLMs are fundamentally handicapped on visual and spatial tasks, and ARC-AGI specifically exploits this weakness
  • The method uses many attempts with an external oracle — this is trial and error, not reasoning; humans solve these in one shot
  • Different AI architectures may be needed for spatial reasoning; transformer models trained on 1D text streams are inherently limited