Veo 3: Emergent Zero‑Shot Video Intelligence Toward Vision Foundation Models

Veo 3 demonstrates wide-ranging zero-shot performance across perception, physical modeling, visual manipulation, and reasoning tasks. These capabilities arise without task-specific training, echoing the emergence seen in LLMs trained on web-scale data. The authors conclude that video models are poised to become generalist vision foundation models.

Key Points

Generative video models trained at scale exhibit emergent zero-shot capabilities across a wide spectrum of visual tasks.
Veo 3 performs perception tasks (from edge detection to interpreting classic visual illusions) without task-specific training.
It demonstrates physical modeling and world understanding (material properties, dynamics, optics, color mixing, memory).
It can manipulate and create visuals (editing, composition, 3D reposing, novel views) and perform dexterous, affordance-aware actions.
Early forms of visual reasoning (graph/maze solving, sequences, analogies, rule extrapolation) emerge, suggesting a path to vision foundation models.

Sentiment

The overall sentiment is predominantly positive and curious, with many commenters expressing enthusiasm and belief in the potential for video models to become generalist vision AI. However, there is a prominent and intense critical voice that fundamentally disputes the approach and claims of these models, arguing they are not on a path to true biological intelligence, though this critical perspective is largely dismissed or flagged by others in the thread.

In Agreement

Video models, through 'moving training data', can build superior spatial understanding applicable even to still images, potentially outperforming image generation models for tasks like human anatomy.
The systematic study of these methods is valuable, confirming observations previously made anecdotally.
A training paradigm that focuses on general predictive modeling (situation => result) and fine-tuning with intent could lead to broad problem-solving AI.
There is a strong belief that we are moving towards 'The One Model that can do it all,' with multimodal models already showcasing such capabilities.
The demonstrated abilities of the video model are perceived as 'incredible' and exciting.

Opposed

The criteria for learning in these models (e.g., segmentation, categorization) do not define biological intelligence; they are 'post-hoc assertions' or 'cog-sci hold-overs' that don't reflect how brains perceive.
These models are 'cheap workarounds' that fundamentally exclude the senses, which are integral to consciousness and genuine intelligence.
ML ignores more 'integrative, coordinated, holistic' approaches to perception (like affective, coordinated-dynamical, and ecological theories) that are considered more likely routes to consciousness.
The ML approach is criticized as 'retrofit' and 'imposed ad hoc on imagery as a pretend form of intelligence,' leading to 'homogenization posing as intelligence' rather than true understanding.
The current trajectory represents 'the classic robotics idea of computer vision backing itself into a corner' and is building 'junk tech from pseudoscience.'