AI Gatekeeper Slashes E2E CI Time by 84%
Read ArticleRead Original Articleadded Sep 6, 2025September 6, 2025

The author replaces brittle glob-based test selection with Claude Code, which uses tool calls to inspect only what’s needed and pick relevant E2E tests for each PR. The system integrates a cleaned git diff, a programmatic test inventory, and a careful prompt, then has Claude write a recommendations file under strict permissions. This reduced E2E runtime from 44 to ~7 minutes while preserving coverage and keeping costs reasonable.
Key Points
- Glob-based test selection is fragile and noisy; a smarter approach is needed to balance coverage and precision.
- Claude Code’s tool calls enable targeted file inspection and dependency tracing instead of loading the entire repo into the model.
- The system combines a clean PR diff, a programmatically derived E2E test list, and a strict prompt (“think deep,” only allowed files, include when unsure).
- Rather than relying on brittle JSON mode, Claude writes a test-recommendations.json file via Edit/Write tools; security is preserved by avoiding overly permissive flags.
- Results show an 84% reduction in runtime (44 to ~7 minutes), no missed relevant tests, acceptable over-selection, and manageable cost (~$30/contributor/month).
Sentiment
Mixed but leaning skeptical: some see a pragmatic pre-merge optimization with guardrails, while many criticize it as risky coverage reduction and prefer deterministic or test-pruning alternatives.
In Agreement
- Using an LLM as a language-agnostic substitute for static analysis can approximate which tests are relevant across diverse stacks.
- It’s reasonable to run fewer E2E tests on PRs if the full suite is executed post-merge or before deploy to catch omissions.
- Maintain always-on smoke tests while trimming narrow, flow-specific E2Es to speed up CI without losing core coverage.
- The technique balances precision/recall to execute likely-to-fail tests, saving developer time and CI/device-farm costs.
- The approach could generalize to other languages and frameworks with minimal changes.
- Probabilistic selection is acceptable when the failure rate is sufficiently low and guardrails (post-merge runs) exist.
Opposed
- This doesn’t reduce test time so much as it reduces coverage, introducing a non-zero chance of missing regressions.
- Probabilistic test selection is unacceptable; engineers want deterministic, reproducible CI behavior.
- Better fix is to prune and simplify the E2E suite and improve CI practices rather than add an LLM layer.
- E2E tests are meant to catch unrelated breakages; selectively running them undermines that purpose.
- At scale, even rare misses are costly; relying on an LLM increases organizational risk.
- Established solutions (dependency graphs, runtime tracing, affected-tests) exist and avoid LLM brittleness.
- Prompt hacks like “think deep” feel unreliable and fragile; model quality variance further erodes trust.
- Title/claims seem inflated; framing speedups as a universal win glosses over risk and coverage trade-offs.