Make AI Work in Big Repos: Spec-First Workflow and Frequent Intentional Compaction

AI coding tools can succeed in large, complex codebases today by structuring the entire development process around context control and spec-first artifacts. The author’s “Frequent Intentional Compaction” workflow (research → plan → implement with subagents and aggressive summarization) enables rapid, high-quality changes while keeping teams mentally aligned. It isn’t magic—human review and expertise remain vital—and the hardest part is organizational change, not model capability.
Key Points
- AI can work in large, complex codebases today if you redesign the development process around context engineering, not just prompts.
- Frequent Intentional Compaction (research → plan → implement) keeps context small, correct, and on-trajectory; use subagents to search/summarize without polluting the main context.
- Focus human review on the highest-leverage artifacts—research and plans—to prevent cascades of bad code and maintain mental alignment.
- Real-world results: rapid bug fixes and 35k LOC of features shipped to a 300k LOC Rust repo in hours, with approvals and working demos.
- It’s not magic: engagement and expertise still matter, and the approach can fail if research is shallow; the biggest challenge is organizational and workflow change, not model capability.
Sentiment
The discussion is highly polarized, with a significant segment expressing strong skepticism and outright opposition to the article's claims and the implied future of software engineering, while another segment enthusiastically agrees and shares similar successful, structured AI-first workflows.
In Agreement
- AI coding tools are effective in large codebases when a structured, spec-first, plan-driven workflow (similar to 'Frequent Intentional Compaction') is adopted, with human review focused on research and plans rather than just code.
- Many users share similar successful workflows involving detailed PRDs, iterative AI code generation, and human verification, finding that AI acts as a powerful tool with engaged human steering.
- The nature of code review is shifting from line-by-line inspection to higher-level verification of specifications and behaviors, enabling higher productivity.
- Using 'ask me clarifying questions' prompts and statically typed languages (like TypeScript or Go) helps improve AI output and catch errors earlier.
- The process, though requiring learning and effort, ultimately leads to significant productivity gains and can smooth over the 'death by a million paper cuts' in complex projects.
- The problems with 'vibecoding' are overcome by rigorous planning and review, which is more about 'abstraction' and 'hyperengineering' than delegation.
Opposed
- There is a lack of concrete, demoable products to substantiate claims of AI's effectiveness in building large, complex systems, with many AI projects reportedly 'falling apart in not-always-obvious ways' or requiring extensive manual fixing.
- AI struggles with the final, crucial 10-20% of complex tasks, and its generated UIs, low-level code, or concurrent logic can be 'weird,' 'garbage,' or fundamentally flawed.
- The cost of advanced AI models and subscriptions is high ($12k/month for a team) and may not be justified by the actual productivity gains, which some argue are comparable to a skilled human engineer or even make users less effective.
- The idea of not reviewing every line of AI-generated code and the acceptance of huge PRs (20k-35k LOC) is seen as 'hostile,' 'disrespectful,' 'unreviewable,' and a 'joke' that degrades the software engineering profession.
- Managing AI's context and inputs can be as much or more work than writing the code manually, leading to a loss of intrinsic motivation for engineers who prefer to solve problems directly.
- AI-generated tests are often poor quality, can encode incorrect assumptions, add unnecessary runtime, and do not provide the same ergonomic feedback as human-written tests.
- AI exhibits bias when verifying its own code, necessitating separate 'red team' agents for review, and often fails or 'stops working' when encountering concepts outside its training data.