AI Gatekeeper Slashes E2E CI Time by 84%

Read Articleadded Sep 6, 2025
AI Gatekeeper Slashes E2E CI Time by 84%

The author replaces brittle glob-based test selection with Claude Code, which uses tool calls to inspect only what’s needed and pick relevant E2E tests for each PR. The system integrates a cleaned git diff, a programmatic test inventory, and a careful prompt, then has Claude write a recommendations file under strict permissions. This reduced E2E runtime from 44 to ~7 minutes while preserving coverage and keeping costs reasonable.

Key Points

  • Glob-based test selection is fragile and noisy; a smarter approach is needed to balance coverage and precision.
  • Claude Code’s tool calls enable targeted file inspection and dependency tracing instead of loading the entire repo into the model.
  • The system combines a clean PR diff, a programmatically derived E2E test list, and a strict prompt (“think deep,” only allowed files, include when unsure).
  • Rather than relying on brittle JSON mode, Claude writes a test-recommendations.json file via Edit/Write tools; security is preserved by avoiding overly permissive flags.
  • Results show an 84% reduction in runtime (44 to ~7 minutes), no missed relevant tests, acceptable over-selection, and manageable cost (~$30/contributor/month).

Sentiment

Mixed but leaning skeptical: some see a pragmatic pre-merge optimization with guardrails, while many criticize it as risky coverage reduction and prefer deterministic or test-pruning alternatives.

In Agreement

  • Using an LLM as a language-agnostic substitute for static analysis can approximate which tests are relevant across diverse stacks.
  • It’s reasonable to run fewer E2E tests on PRs if the full suite is executed post-merge or before deploy to catch omissions.
  • Maintain always-on smoke tests while trimming narrow, flow-specific E2Es to speed up CI without losing core coverage.
  • The technique balances precision/recall to execute likely-to-fail tests, saving developer time and CI/device-farm costs.
  • The approach could generalize to other languages and frameworks with minimal changes.
  • Probabilistic selection is acceptable when the failure rate is sufficiently low and guardrails (post-merge runs) exist.

Opposed

  • This doesn’t reduce test time so much as it reduces coverage, introducing a non-zero chance of missing regressions.
  • Probabilistic test selection is unacceptable; engineers want deterministic, reproducible CI behavior.
  • Better fix is to prune and simplify the E2E suite and improve CI practices rather than add an LLM layer.
  • E2E tests are meant to catch unrelated breakages; selectively running them undermines that purpose.
  • At scale, even rare misses are costly; relying on an LLM increases organizational risk.
  • Established solutions (dependency graphs, runtime tracing, affected-tests) exist and avoid LLM brittleness.
  • Prompt hacks like “think deep” feel unreliable and fragile; model quality variance further erodes trust.
  • Title/claims seem inflated; framing speedups as a universal win glosses over risk and coverage trade-offs.
AI Gatekeeper Slashes E2E CI Time by 84%