LLM-as-a-Courtroom: Evidence-Backed Doc Updates from Code Changes

Falconer replaces unreliable LLM scoring with an LLM-as-a-Courtroom system to decide when code changes warrant documentation updates. Prosecutor, Defense, Jury, and Judge roles enforce evidence-backed arguments, adversarial testing, and a consistent final verdict with concise edit proposals. In production, the strict, precision-first system filters most noise and aligns well with human reviewers, while ongoing work addresses jury bias, testing, and edge cases.

Key Points

Numeric scoring by LLMs was unreliable; reframing decisions as structured arguments leverages LLM strengths in explanation and reasoning.
A courtroom architecture (Prosecutor, Defense, Jury, Judge) enforces evidence, adversarial debate, independent deliberation, and a consistent final ruling.
Prosecutor exhibits must include an exact PR quote, an exact document quote, and a concrete harm statement to ground claims in verifiable context.
The system is tuned for high precision to sustain trust, aggressively filtering cases and proposing few, specific edits when necessary.
Production results show strong filtering and high human-alignment rates, though challenges remain (jury bias, observability needs, real-world edge cases).

Sentiment

Moderately mixed. Technical contributors engage constructively with the approach and the authors provide substantive responses to criticism. However, there is notable skepticism about whether the adversarial complexity is justified without baseline comparisons. Cost concerns are effectively addressed by the funnel architecture explanation, but the missing baseline comparison remains an unanswered criticism. A philosophical tangent about LLM understanding runs heated but is mostly separate from the core technical debate.

In Agreement

LLMs are better at structured argumentation than numerical scoring, making the courtroom metaphor a natural fit for decision-making
The adversarial multi-agent structure provides checks that a single prompt cannot achieve, with each role contributing independent reasoning from different contexts
The funnel architecture makes the system cost-effective by filtering most PRs before the expensive courtroom stage, so only 1-2% trigger full adversarial review
Functional correctness of input-output mapping matters more than whether the LLM truly understands — systems don't need to be perfect to be useful

Opposed

No comparison to a simpler baseline such as a single reasoning judge makes the reported accuracy figure hard to evaluate
Running four distinct agent roles per update could be prohibitively expensive in latency and token spend without clear evidence it outperforms simpler approaches
LLMs fundamentally do not understand concepts like user harm and cannot truly reason about consequences, making the courtroom framing misleading
Error rates in LLMs compound with more sequential steps, and successful AI demonstrations tend to be cherry-picked and heavily human-curated