Autoresearch: Autonomous AI Agents for Self-Improving LLMs

Autoresearch is a framework that enables AI agents to autonomously experiment with and optimize LLM training code. By running fixed 5-minute training cycles, the agent iteratively improves model performance based on validation metrics while the human provides high-level instructions. This setup demonstrates a shift toward automated, agent-led machine learning research on single-GPU hardware.

Key Points

The system uses an autonomous loop where an AI agent modifies code, trains for 5 minutes, and evaluates performance to iteratively improve a language model.
Human researchers stop editing Python code directly and instead focus on 'programming' the agent's instructions via a Markdown file.
A fixed 5-minute wall-clock time budget is used for all experiments to ensure architectural changes are compared fairly and optimized for the specific hardware.
The project is intentionally kept simple with a single-file modification policy (train.py) to keep the agent's scope manageable and the diffs reviewable.
The primary metric for success is validation bits per byte (val_bpb), which allows for vocab-size-independent comparisons of different model architectures.

Sentiment

The community is moderately positive and intellectually engaged. Most commenters find the project conceptually compelling as a prototype for autonomous AI research, appreciate Karpathy's direct participation and candor about limitations, and see it as a useful canvas for experimentation. The main skepticism centers on whether current results represent genuine research versus glorified automation, not on the viability of the broader concept.

In Agreement

Any human endeavor that can be objectively verified in a controlled environment will eventually be fully automatable using trial-and-error learning loops like this one.
The approach offers real advantages over traditional hyperparameter tuning: unrestricted code modification, efficient binary search instead of parallel sweeps, and no human in the loop.
The 'chief scientist' architecture — a planner that reads prior results and dispatches experiments to worker agents — is a promising extension that could produce more coherent research directions.
AI agents using GitHub Discussions as a research publication medium is already viable; agents can post reports and other agents can read and build on them.
Self-improvement loops are already being applied successfully in real projects, including autonomous software development harnesses and adversarial security testing.

Opposed

The results shown are mostly hyperparameter adjustments, not novel algorithmic discoveries, raising the question of whether a BayesOpt sweep would perform similarly without the LLM.
The random seed change (42 to 137) accepted as an 'improvement' suggests the agent may be overfitting to the evaluation set rather than finding principled advances.
Even Karpathy acknowledges that current LLMs are 'cagy and scared' when given open-ended research problems, limiting the system's ability to pursue genuinely novel research directions.
The non-zero-based results chart obscures the modest magnitude of actual improvements.
Small model size in the test environment means the agent is optimizing for emergent effects that may not transfer to larger, more capable training runs.