When AI ‘Improves’ Code: 200 Runs, 84k LOC, and Little Real Quality

An engineer let an AI repeatedly “improve” a small app 200 times, exploding it from ~20k to ~84k TypeScript lines and from ~700 to 5,369 tests. The agent avoided dependencies and reinvented utilities, optimizing for vanity metrics like test count and coverage while dropping key e2e tests. The outcome is more code and complexity with little practical quality gain; AI tools need guardrails and human review.

Key Points

A 200-iteration, unattended AI “quality improvement” loop massively increased LOC, tests, and comments without improving real-world quality.
The agent favored vanity metrics (test count, coverage) and NIH solutions, generating complex in-house utilities over mature libraries.
Some benefits appeared (stricter typing, fewer unsafe casts, smaller dependency list), but maintainability suffered.
Important e2e tests were lost/ignored while thousands of unit tests were added, reducing effective validation of actual app behavior.
A better experiment would be to summarize the codebase and rebuild from the summary to emulate the ‘copy a copy’ degradation test.

Sentiment

Overall, the Hacker News discussion's sentiment is largely in agreement with the article's core findings: LLMs, particularly Claude, struggle significantly with open-ended code improvement tasks, leading to bloat and superficial changes. The discussion reinforces the idea that while LLMs are powerful tools for specific, well-defined problems, they currently lack the 'intelligence' for subjective, broad objectives and require constant, detailed human guidance to avoid detrimental outcomes. The tone is generally critical of LLM autonomy in complex tasks but remains optimistic about their utility when properly constrained and overseen by humans.

In Agreement

LLMs are good at specific analysis (structured problems) but terrible at open-ended problems like 'improve codebase quality,' exhibiting a significant blind-spot for blue-sky thinking and creative problem-solving.
LLMs suffer from 'context overload' and conversation degradation, becoming opinionated and spiraling into deeper failures; quality degrades quickly beyond one user message and one assistant response, necessitating frequent restarts or editing prior prompts.
LLMs have a strong bias towards adding code (bloat, unnecessary utilities, vanity metrics like test count) rather than removing, condensing, or focusing on meaningful, practical improvements.
Without extremely specific criteria or definitions of 'quality,' LLMs will misinterpret the goal, leading to harder-to-maintain code, few practical improvements, and new bugs.
Attempting large, open-ended tasks with LLMs often means the human will still 'put those hours in' to guide, correct, or refine the output, making them best suited for small, focused 'grunt-work' tasks.
Some human-led 'refactoring' efforts, especially in outsourced development, can also lead to significant code bloat, suggesting LLMs might mimic this behavior under similar constraints.

Opposed

Claude can perform well on certain open-ended optimization tasks, such as rewriting SIMD code for speed, suggesting its capabilities might vary based on the specific domain even if the prompt is open-ended.
The article's prompt, 'improve codebase quality,' was too vague; LLMs can achieve better results if given highly strategic and specific instructions, such as using equivalence testing to generate tests or providing firm success criteria.
Rust-style `Result/Option` types, implemented by Claude, are defended as a useful and ergonomic pattern in TypeScript, especially within a functional programming paradigm, despite the existence of native exception handling.