Gemini Omni: Conversational Video Creation and Multimodal Editing

Gemini Omni is a multimodal AI model that allows users to create and edit videos through natural language and diverse reference inputs. It features advanced capabilities like iterative editing, character swapping, and a deep understanding of real-world physics and history. Developed with safety in mind, it includes digital watermarking and is available across Google's creative platforms.

Key Points

Gemini Omni enables conversational, iterative video editing that maintains visual and narrative consistency across multiple turns.
The model combines multimodal inputs—including text, images, video, and audio—to generate or transform cohesive video outputs.
It leverages deep world knowledge to accurately simulate physics, fluid dynamics, and historical or scientific contexts within creative scenes.
Advanced features include motion and style transfer, character swapping, and the ability to translate simple sketches into realistic footage.
Safety and transparency are prioritized through internal red-teaming and the use of SynthID watermarking to identify AI-generated content.

Sentiment

The overall sentiment is mixed and cautious. HN mostly accepts that Gemini Omni is a meaningful technical advance, especially for editing and consistency, but the community does not fully buy the stronger implication that it has solved realistic world modeling or safe, trustworthy video generation. The dominant tone is curious skepticism: impressed by the capability, quick to test its limits, and uneasy about the cultural and evidentiary consequences.

In Agreement

The model is technically impressive and appears to reduce the uncanny feeling of generated video, especially for casual viewing.
Iterative editing, scene consistency, and multimodal control may be more important than raw one-shot generation.
World knowledge and spatial awareness could unlock useful workflows such as location-aware simulation, storyboarding, and previs.
Future models may learn better dynamics with richer spatial representations, tracking pipelines, simulators, or differentiable physics rather than needing an entirely different paradigm.
AI-assisted filmmaking could be valuable when directed by people with strong creative intent, and some viewers would be open to watching such work.

Opposed

The generated examples still fail at realistic physical causality, with objects morphing, disappearing, or moving like visual impressions instead of simulated bodies.
Some commenters argue the outputs are technically impressive but artistically weak, sterile, or likely to increase low-quality media.
Several participants question whether video generation is the right strategic focus for Google compared with other AI applications.
Synthetic video raises serious authenticity and evidence problems, and watermarking or provenance systems may not be enough to restore trust.
Creative tools do not automatically create good work; taste, imagination, and direction remain limiting factors for many users.