Harness design for long-running application development \ Anthropic

Anthropic engineers developed a multi-agent harness to improve Claude's ability to handle long-running, complex tasks like app development and frontend design. By implementing a 'Generator-Evaluator' loop and a dedicated 'Planner' agent, they overcame common issues like context anxiety and poor self-critique. The study demonstrates that while advanced scaffolding significantly boosts performance, these architectures should be simplified as underlying models become more natively capable.

Key Points

Naive agent implementations often fail due to 'context anxiety' and an inability to objectively self-evaluate their own work.
A multi-agent 'Generator-Evaluator' loop, where a separate agent critiques output based on specific criteria, significantly improves quality in both subjective design and functional coding.
Using a 'Planner' agent to expand short prompts into comprehensive product specifications prevents under-scoping and provides a roadmap for the generator.
Integrating tools like the Playwright MCP allows evaluator agents to perform 'black-box' testing on live applications, catching bugs that static code analysis might miss.
Harness design must be dynamic; as underlying models improve, engineers should strip away redundant scaffolding to reduce cost and latency while finding new ways to push the frontier.