Scaling Autoresearch: How 16 GPUs Transform AI-Driven Discovery

By giving Claude Code access to 16 GPUs via SkyPilot, researchers accelerated an autonomous neural network training loop by 9x. The agent completed over 900 experiments in 8 hours, discovering that model width was the most critical factor for performance. This parallel approach allowed the agent to move beyond simple trial-and-error to advanced factorial searches and autonomous hardware optimization.

Key Points

Parallelism increased experiment throughput by 9x, allowing the agent to reach optimal validation loss in 8 hours compared to a projected 72 hours for sequential runs.
The agent transitioned from simple sequential testing to complex factorial grid searches, enabling it to identify optimal model widths and hyperparameter interactions in single waves.
The AI agent autonomously developed a hardware-aware strategy, utilizing faster H200 GPUs for high-precision validation while using H100s for broad screening.
Architecture discovery, specifically scaling model width, provided a more significant performance boost than hyperparameter tuning alone.
The total cost for the 8-hour session was approximately $300, including both AI API fees and GPU compute costs.

Sentiment

The community is notably divided. There is genuine technical interest in agent-driven code exploration beyond simple hyperparameter tuning, bolstered by Karpathy's direct participation and concrete examples. However, significant skepticism exists about whether current implementations deliver on the research promise versus being glorified hyperparameter search. The efficiency criticism — that parallelization is just trading money for time — resonates strongly, as does frustration with perceived hype around the concept.

In Agreement

Karpathy argues the agent can modify code arbitrarily and implement novel architecture changes like the smear gate and backout skip connection — fundamentally different from hyperparameter tuning
An NVIDIA researcher corroborates the value with their own use of agent-driven exploration for deep code changes in training sparse autoencoders
The emergent H100/H200 tiered screening strategy demonstrates the agent's ability to reason about resource allocation without explicit instruction
Recent LLM achievements in mathematics demonstrate genuine capacity for novel reasoning beyond training data regurgitation
Giving agents access to literature like arxiv could significantly diversify the methods they explore

Opposed

In practice, the vast majority of changes autoresearch makes could be found faster with Bayesian optimization if properly parameterized
The 16-GPU setup is roughly half as efficient in GPU-hours as the single-GPU setup — the blog oversells what is essentially trading money for wall-clock time
The concept is not novel — people have been running similar agent-in-a-loop optimization for over a year, and Karpathy's version gets disproportionate attention due to his following
Short training runs risk optimizing for early velocity rather than long-term asymptotic performance
LLMs may be creating an illusion of research progress through information abundance rather than genuine substance