AI Gatekeeper Slashes E2E CI Time by 84%

The author replaces brittle glob-based test selection with Claude Code, which uses tool calls to inspect only what’s needed and pick relevant E2E tests for each PR. The system integrates a cleaned git diff, a programmatic test inventory, and a careful prompt, then has Claude write a recommendations file under strict permissions. This reduced E2E runtime from 44 to ~7 minutes while preserving coverage and keeping costs reasonable.

Key Points

Glob-based test selection is fragile and noisy; a smarter approach is needed to balance coverage and precision.
Claude Code’s tool calls enable targeted file inspection and dependency tracing instead of loading the entire repo into the model.
The system combines a clean PR diff, a programmatically derived E2E test list, and a strict prompt (“think deep,” only allowed files, include when unsure).
Rather than relying on brittle JSON mode, Claude writes a test-recommendations.json file via Edit/Write tools; security is preserved by avoiding overly permissive flags.
Results show an 84% reduction in runtime (44 to ~7 minutes), no missed relevant tests, acceptable over-selection, and manageable cost (~$30/contributor/month).

Sentiment

Mixed but leaning skeptical: some see a pragmatic pre-merge optimization with guardrails, while many criticize it as risky coverage reduction and prefer deterministic or test-pruning alternatives.

In Agreement

Using an LLM as a language-agnostic substitute for static analysis can approximate which tests are relevant across diverse stacks.
It’s reasonable to run fewer E2E tests on PRs if the full suite is executed post-merge or before deploy to catch omissions.
Maintain always-on smoke tests while trimming narrow, flow-specific E2Es to speed up CI without losing core coverage.
The technique balances precision/recall to execute likely-to-fail tests, saving developer time and CI/device-farm costs.
The approach could generalize to other languages and frameworks with minimal changes.
Probabilistic selection is acceptable when the failure rate is sufficiently low and guardrails (post-merge runs) exist.

Opposed

This doesn’t reduce test time so much as it reduces coverage, introducing a non-zero chance of missing regressions.
Probabilistic test selection is unacceptable; engineers want deterministic, reproducible CI behavior.
Better fix is to prune and simplify the E2E suite and improve CI practices rather than add an LLM layer.
E2E tests are meant to catch unrelated breakages; selectively running them undermines that purpose.
At scale, even rare misses are costly; relying on an LLM increases organizational risk.
Established solutions (dependency graphs, runtime tracing, affected-tests) exist and avoid LLM brittleness.
Prompt hacks like “think deep” feel unreliable and fragile; model quality variance further erodes trust.
Title/claims seem inflated; framing speedups as a universal win glosses over risk and coverage trade-offs.