Autoresearch: Autonomous AI Agents for Self-Improving LLMs

Autoresearch is a framework that enables AI agents to autonomously experiment with and optimize LLM training code. By running fixed 5-minute training cycles, the agent iteratively improves model performance based on validation metrics while the human provides high-level instructions. This setup demonstrates a shift toward automated, agent-led machine learning research on single-GPU hardware.
Key Points
- The system uses an autonomous loop where an AI agent modifies code, trains for 5 minutes, and evaluates performance to iteratively improve a language model.
- Human researchers stop editing Python code directly and instead focus on 'programming' the agent's instructions via a Markdown file.
- A fixed 5-minute wall-clock time budget is used for all experiments to ensure architectural changes are compared fairly and optimized for the specific hardware.
- The project is intentionally kept simple with a single-file modification policy (train.py) to keep the agent's scope manageable and the diffs reviewable.
- The primary metric for success is validation bits per byte (val_bpb), which allows for vocab-size-independent comparisons of different model architectures.
Sentiment
The community is moderately positive and intellectually engaged. Most commenters find the project conceptually compelling as a prototype for autonomous AI research, appreciate Karpathy's direct participation and candor about limitations, and see it as a useful canvas for experimentation. The main skepticism centers on whether current results represent genuine research versus glorified automation, not on the viability of the broader concept.
In Agreement
- Any human endeavor that can be objectively verified in a controlled environment will eventually be fully automatable using trial-and-error learning loops like this one.
- The approach offers real advantages over traditional hyperparameter tuning: unrestricted code modification, efficient binary search instead of parallel sweeps, and no human in the loop.
- The 'chief scientist' architecture — a planner that reads prior results and dispatches experiments to worker agents — is a promising extension that could produce more coherent research directions.
- AI agents using GitHub Discussions as a research publication medium is already viable; agents can post reports and other agents can read and build on them.
- Self-improvement loops are already being applied successfully in real projects, including autonomous software development harnesses and adversarial security testing.
Opposed
- The results shown are mostly hyperparameter adjustments, not novel algorithmic discoveries, raising the question of whether a BayesOpt sweep would perform similarly without the LLM.
- The random seed change (42 to 137) accepted as an 'improvement' suggests the agent may be overfitting to the evaluation set rather than finding principled advances.
- Even Karpathy acknowledges that current LLMs are 'cagy and scared' when given open-ended research problems, limiting the system's ability to pursue genuinely novel research directions.
- The non-zero-based results chart obscures the modest magnitude of actual improvements.
- Small model size in the test environment means the agent is optimizing for emergent effects that may not transfer to larger, more capable training runs.