AI Gatekeeper Slashes E2E CI Time by 84%

Added Sep 6, 2025
Article: PositiveCommunity: NegativeDivisive
AI Gatekeeper Slashes E2E CI Time by 84%

The author replaces brittle glob-based test selection with Claude Code, which uses tool calls to inspect only what’s needed and pick relevant E2E tests for each PR. The system integrates a cleaned git diff, a programmatic test inventory, and a careful prompt, then has Claude write a recommendations file under strict permissions. This reduced E2E runtime from 44 to ~7 minutes while preserving coverage and keeping costs reasonable.

Key Points

  • Glob-based test selection is fragile and noisy; a smarter approach is needed to balance coverage and precision.
  • Claude Code’s tool calls enable targeted file inspection and dependency tracing instead of loading the entire repo into the model.
  • The system combines a clean PR diff, a programmatically derived E2E test list, and a strict prompt (“think deep,” only allowed files, include when unsure).
  • Rather than relying on brittle JSON mode, Claude writes a test-recommendations.json file via Edit/Write tools; security is preserved by avoiding overly permissive flags.
  • Results show an 84% reduction in runtime (44 to ~7 minutes), no missed relevant tests, acceptable over-selection, and manageable cost (~$30/contributor/month).

Sentiment

The discussion is predominantly skeptical. While some commenters defend the approach as pragmatic and consistent with existing industry practices, the majority express concern about reduced coverage guarantees, non-determinism, and the difficulty of validating that the right tests were selected. The author's participation helps ground the discussion, but the community remains unconvinced that an LLM should be trusted with this decision.

In Agreement

  • Probabilistic test selection is already standard practice at large tech companies, and existing heuristics were often poor and arbitrary — an LLM is a reasonable improvement
  • Running a targeted subset pre-merge with a full suite nightly is a pragmatic tradeoff that most teams already make in some form
  • The Claude Code SDK's tool-use approach of iteratively inspecting relevant code is smarter than trying to dump the whole repository into context
  • The author's real-world production experience shows the approach working without missed test failures

Opposed

  • The approach doesn't reduce test time — it reduces test coverage by skipping tests the LLM deems irrelevant, which fundamentally changes the safety guarantee
  • You cannot validate correctness by observing that all selected tests pass — the absence of failures doesn't prove the right tests were chosen
  • Non-deterministic test selection is worse than deterministic heuristics because it can't be audited, reproduced, or reasoned about consistently
  • Meta and other large companies already use similar heuristics and find them painful when they fail — an LLM adds more unpredictability, not less
  • Deterministic alternatives like static dependency analysis, code ownership mappings, and git blame would be more reliable and auditable
  • In distributed systems, changes in one service can cause failures in unrelated downstream services, making AI-based test selection especially risky
AI Gatekeeper Slashes E2E CI Time by 84% | TD Stuff