Harness design for long-running application development \ Anthropic
Added
Article: Very Positive

Anthropic engineers developed a multi-agent harness to improve Claude's ability to handle long-running, complex tasks like app development and frontend design. By implementing a 'Generator-Evaluator' loop and a dedicated 'Planner' agent, they overcame common issues like context anxiety and poor self-critique. The study demonstrates that while advanced scaffolding significantly boosts performance, these architectures should be simplified as underlying models become more natively capable.
Key Points
- Naive agent implementations often fail due to 'context anxiety' and an inability to objectively self-evaluate their own work.
- A multi-agent 'Generator-Evaluator' loop, where a separate agent critiques output based on specific criteria, significantly improves quality in both subjective design and functional coding.
- Using a 'Planner' agent to expand short prompts into comprehensive product specifications prevents under-scoping and provides a roadmap for the generator.
- Integrating tools like the Playwright MCP allows evaluator agents to perform 'black-box' testing on live applications, catching bugs that static code analysis might miss.
- Harness design must be dynamic; as underlying models improve, engineers should strip away redundant scaffolding to reduce cost and latency while finding new ways to push the frontier.