From Word Models to World Models: Training AI for Adversarial Robustness

Added Feb 9
Article: NeutralCommunity: PositiveDivisive
From Word Models to World Models: Training AI for Adversarial Robustness

Experts outperform LLMs because they model how other agents with hidden incentives will adapt to their moves, not just how to craft plausible text. Chess-like tasks favor LLMs, but real-world work often becomes poker-like, requiring unreadable, balanced strategies and theory-of-mind. The path forward is training on next-state outcomes in multi-agent environments to reduce exploitability and build adversarial robustness.

Key Points

  • Expert performance hinges on deep multi-agent world models that simulate hidden incentives, adaptations, and theory-of-mind—far beyond producing coherent text.
  • Perfect-information tasks (chess-like) reward calculation over modeling minds; imperfect-information tasks (poker-like) demand unreadable, balanced strategies and recursive reasoning.
  • LLMs are trained to generate artifacts that score well in isolation (RLHF), making them predictable and exploitable in adversarial settings where others adapt.
  • Scaling intelligence or prompting for ‘strategy’ doesn’t fix the missing training loop; crucial causal knowledge is in outcomes and adaptations not fully present in text corpora.
  • Fix: evaluate and train LLMs as agents in multi-agent, hidden-state environments, optimizing for next-state outcomes and adversarial robustness rather than one-shot plausibility.

Sentiment

The community is broadly sympathetic to the article's thesis that LLMs have fundamental limitations in adversarial and multi-agent settings, but with substantial nuance. Most commenters agree with the general framework while pushing back on its absolutism — arguing the limitations are real but more addressable than the article suggests, or that the framing needs refinement (cognition vs. authority rather than word vs. world). The tone is intellectual and constructive, with the article's author participating and engaging substantively with critiques.

In Agreement

  • LLMs are fundamentally constrained by their linguistic training data and model 'talking about reality' rather than reality itself, since tokenized text captures only a fraction of human world models
  • The chess-vs-poker distinction is real and important: LLMs fail in domains requiring theory-of-mind, hidden information, and adversarial reasoning that aren't well-represented in training text
  • LLMs are dangerously unreliable in professional domains like law because they cannot distinguish expert from non-expert sources, treating all training data with equal credulity
  • The cooperative, agreeable bias from RLHF training makes LLMs predictable and exploitable in adversarial settings — they lack training that rewards survival under strategic adaptation
  • Training data is the fundamental bottleneck: reaching the next level of capability requires corresponding data that demonstrates adversarial reasoning abilities, which largely doesn't exist in text form

Opposed

  • Many of the article's failure examples are artifacts of naive prompting — when asked to review or reflect, LLMs produce significantly more nuanced and context-aware responses
  • The word-model vs world-model framing is a false binary: LLMs encode stable relational structure including entities, roles, causality, and counterfactuals that constitute genuine (if partial) world understanding
  • The chess-vs-poker distinction isn't as clean as claimed — AlphaZero research shows chess also involves meaningful hidden state in strategic concepts, and poker AI (Pluribus) already exists
  • Multimodality and reinforcement learning fine-tuning are already addressing these limitations, making the critique a snapshot of a rapidly moving target
  • Both sides of the debate miss the real point: the issue isn't whether LLMs understand the world but under what conditions their outputs should be treated as authoritative — separating cognition from decision dissolves the disagreement