AI Failures Drift Toward Incoherence as Tasks and Reasoning Grow
Using a bias-variance lens, the authors find that frontier models become more incoherent as tasks grow harder and reasoning lengthens, with variance dominating failures. Scale improves coherence on easy tasks but not hard ones; natural overthinking increases incoherence, while ensembling can reduce it. A synthetic optimizer experiment reinforces that models learn correct objectives faster than reliable optimization, shifting expected failure modes toward industrial-accident-like unpredictability.
Key Points
- Error decomposition shows that as tasks get harder and reasoning/actions get longer, model failures shift toward variance (incoherence) rather than bias (systematic misalignment).
- Model scale improves coherence on easy tasks but not on hard tasks; on challenging problems, larger models often stay equally or become more incoherent.
- Spontaneous, longer-than-usual reasoning (“overthinking”) increases incoherence more than increasing reasoning budgets can mitigate.
- Ensembling reliably reduces incoherence but may be impractical for agentic tasks with irreversible actions.
- LLMs behave as dynamical systems that must be trained to be optimizers; in a synthetic optimizer setting, scale reduces bias faster than variance, and incoherence grows with trajectory length.
Sentiment
The community is largely receptive and finds the research valuable, particularly practitioners building AI systems who see the findings validated in their own work. There is constructive criticism about the paper's scope and extrapolative claims, but the technical substance is generally respected. A minority dismisses the findings as obvious or questions whether they apply to future model architectures. The overall tone is engaged and substantive rather than dismissive.
In Agreement
- Multi-agent orchestration — using cheaper models for execution and smarter ones for planning — produces better coherence, validating the paper's finding that ensembling helps reduce variance
- Task decomposition fundamentally changes the bias/variance profile of each inference call, trading one high-variance call for multiple lower-variance ones that are more predictable overall
- Natural overthinking causing incoherence matches practitioner experience: models that read complex prompts sometimes overthink themselves into helpful variations that break workflows
- The bias-variance framing maps well to practical AI workflows — systematic misalignment (bias) is relatively easy to fix with prompt engineering, while variance-dominated failures are much harder to address
- Keeping prompts small and tool counts low leads to more stable outputs; context pollution from reading files degrades token coherence
Opposed
- The paper uses current AI to extrapolate future failure modes, which is limited — if future models solve the variance problem, the analysis becomes moot
- The tested models are already behind state-of-the-art, so the results may not hold for latest frontier models
- The finding about overthinking increasing incoherence is definitionally obvious for probabilistic systems and doesn't require a paper to demonstrate
- LLMs don't actually reason or do inference, so it's unsurprising they can't maintain systematic misalignment — incoherence is the expected default
- The paper reads as liability-shifting: framing AI failures as accidents rather than design flaws
- The study may only apply to autoregressive models, limiting its generalizability to future architectures