Veo 3: Emergent Zero‑Shot Video Intelligence Toward Vision Foundation Models

Added Sep 25, 2025
Article: PositiveCommunity: PositiveMixed
Veo 3: Emergent Zero‑Shot Video Intelligence Toward Vision Foundation Models

Veo 3 demonstrates wide-ranging zero-shot performance across perception, physical modeling, visual manipulation, and reasoning tasks. These capabilities arise without task-specific training, echoing the emergence seen in LLMs trained on web-scale data. The authors conclude that video models are poised to become generalist vision foundation models.

Key Points

  • Generative video models trained at scale exhibit emergent zero-shot capabilities across a wide spectrum of visual tasks.
  • Veo 3 performs perception tasks (from edge detection to interpreting classic visual illusions) without task-specific training.
  • It demonstrates physical modeling and world understanding (material properties, dynamics, optics, color mixing, memory).
  • It can manipulate and create visuals (editing, composition, 3D reposing, novel views) and perform dexterous, affordance-aware actions.
  • Early forms of visual reasoning (graph/maze solving, sequences, analogies, rule extrapolation) emerge, suggesting a path to vision foundation models.

Sentiment

The community is predominantly positive, impressed by the breadth and systematicness of Veo 3's demonstrated zero-shot capabilities. Most commenters see these results as compelling evidence for the trajectory toward unified vision foundation models. The main dissenting voice argues from a philosophy-of-mind perspective that these benchmarks are fundamentally misguided, but this position is not widely supported and the associated thread was partially flagged by the community.

In Agreement

  • The systematic study of video models' emergent zero-shot capabilities is a valuable contribution, building on ideas previously circulating informally
  • Video-trained models build superior spatial understanding that transfers even to still image generation, outperforming dedicated image models for tasks like human anatomy
  • AI training can be reframed as teaching models to extrapolate future states from a starting condition plus intent, which naturally yields broad problem-solving ability
  • These findings suggest convergence toward a single unified model architecture capable of general-purpose vision tasks

Opposed

  • The benchmark categories used (segmentation, affordance recognition, etc.) are holdovers from outdated cognitive science frameworks and do not reflect how biological intelligence actually perceives
  • ML ignores integrative, ecological, and affective approaches to perception that are more likely routes to genuine intelligence and consciousness
  • These models are cheap workarounds that exclude the senses and represent homogenization posing as intelligence rather than true paths to vision understanding
Veo 3: Emergent Zero‑Shot Video Intelligence Toward Vision Foundation Models | TD Stuff