Automating ML Research: Claude Code vs. the eCLIP Optimization Loop

The author used Claude Code to implement an autonomous research loop that optimized an eCLIP model's performance on a dataset of Japanese woodblock prints. The agent successfully reduced the model's Mean Rank by 54% through bug fixes and hyperparameter tuning within 42 automated experiments. While highly efficient for structured tasks, the agent reached its limits when attempting complex architectural changes, suggesting LLMs are best suited for the 'first 90%' of research optimization.

Key Points

The Autoresearch framework uses an LLM agent to drive a tight loop of hypothesis, code modification, and evaluation to optimize ML models.
The agent was highly effective at 'low-hanging fruit,' such as identifying code bugs and performing methodical hyperparameter optimization.
Sandboxing via containerization and restricted permissions is essential for safely allowing LLM agents to execute code on a workstation.
The experiment showed diminishing returns: the agent excelled at structured optimization but struggled with creative architectural innovations and 'unknown unknowns.'
A 54% improvement in the evaluation metric was achieved autonomously in one day, demonstrating the efficiency of LLMs for the 'tedious' parts of research.

Sentiment

The community is cautiously positive about autoresearch as a practical tool while being skeptical of its broader implications. Most commenters appreciate the honest reporting of both successes and limitations, but push back on framing it as "research" rather than optimization. The dominant view is that the approach has clear value for well-defined optimization tasks but is far from the autonomous research agent that the hype suggests.

In Agreement

The approach successfully found bugs and performed systematic optimizations that the human researcher had missed, demonstrating real practical value
Autoresearch is accessible and simple to set up compared to more sophisticated automated research frameworks, making it broadly applicable to any problem with verifiable metrics
Running optimization loops overnight while researchers sleep has genuine value even if many individual trials are unproductive
Real-world results like Shopify's Liquid template 53% speedup validate the approach beyond just ML research contexts
The scratchpad/working memory pattern and sandboxed container approach represent good engineering practices for autonomous agent workflows

Opposed

Dedicated hyperparameter optimization tools like Optuna use statistically principled Bayesian methods that are more efficient and likely more optimal than an LLM manually tweaking parameters one at a time
The agent's inability to make novel architectural changes reveals that autoresearch is optimization, not research — it cannot question hypotheses or metrics, only optimize within a fixed box
Running many experiments on cloud infrastructure is prohibitively expensive for non-VC-backed companies, making the approach inaccessible to many practitioners
Speed of development is a poor metric for productivity — letting agents produce inscrutable code creates maintenance and security risks, as evidenced by vibe-coded production bugs like exposed API keys
The extensive setup required (AGENTS.md, sandboxing, permission constraints) suggests LLMs need extreme guidance to reach useful goals, undermining claims of autonomous capability