Gemini 2.5 Computer Use: High‑performance, safe UI control via API

Google DeepMind launched the Gemini 2.5 Computer Use model via the Gemini API to let agents interact with UIs by iteratively analyzing screenshots and issuing structured actions. It leads major web and mobile control benchmarks while offering lower latency and built-in safety measures, plus per-step action review and confirmation controls. The model is available in public preview with demos, docs, and reference implementations in Google AI Studio and Vertex AI.

Key Points

New specialized Computer Use model (built on Gemini 2.5 Pro) lets agents operate web and mobile UIs via a computer_use tool that runs in an observe–act loop using screenshots and action history.
Leads on key web/mobile control benchmarks (Online-Mind2Web, WebVoyager, AndroidWorld) and achieves lower latency per Browserbase evaluations.
Safety-by-design plus developer guardrails: per-step safety review of proposed actions and system instructions to refuse or require confirmation for high-risk operations.
Early adopters report strong results in speed, reliability, and UI test resilience; model underpins projects like Project Mariner, Firebase Testing Agent, and some AI Mode in Search capabilities.
Available now in public preview through the Gemini API on Google AI Studio and Vertex AI, with demos, docs, and reference code to get started.

Sentiment

The overall sentiment of the Hacker News discussion is mixed, leaning towards cautious optimism and pragmatic enthusiasm, but with significant criticisms. Many users acknowledge the practical utility and the innovative approach of controlling UIs via screenshots, especially for messy real-world applications lacking APIs. However, there are notable concerns regarding the model's current performance, including its speed, accuracy, and limitations in interpreting complex visual feedback, as well as fundamental debates over whether visual input is the most efficient interface for AI.

In Agreement

The approach of using computer vision on screenshots for UI interaction, despite seeming complex, is a practical and effective solution given the current state of technology and the messy, unstructured nature of many real-world UIs.
It provides a way to automate tasks for services that lack structured APIs, enabling interaction with a broader range of applications and bypassing the 'adversarial' nature of the internet where monetization prevents easy programmatic access.
This technology has the potential to significantly reduce 'soul-crushing drudgery' for office workers dealing with poorly designed, repetitive UIs (e.g., in enterprise software like SAP).
The model shows promise for general application control and accessibility, even for complex tasks like booking flights or managing apps via voice commands, and performance speed might be less critical if tasks can run in the background.
There's abundant pretraining data available for screen recordings and mouse movements, which supports the visual-based approach over less common accessibility tree data.

Opposed

The model exhibits limitations in interpreting complex visual feedback (e.g., Wordle colors) and struggles with accuracy in UI interaction, often requiring multiple attempts or making precision errors (e.g., misclicking in the HN demo, errors in Google Sheets).
It is perceived as too slow for certain applications like E2E testing, with the iterative action-feedback loop introducing significant latency compared to traditional automation tools like Playwright.
Critics argue that using screenshots is an inefficient and less robust interface compared to leveraging structured data from accessibility trees or DOM, which provide richer context and can be more precise.
Concerns were raised about the model's ability to handle dynamic or hidden UI elements effectively without explicit, potentially inefficient, exploration, and its general reliability in practical, high-stakes scenarios.
Deployment in critical enterprise systems will require additional governance, trust, and control mechanisms (like hooks/callbacks) to ensure deterministic guarantees and manage risks, which are currently lacking.
Skepticism exists regarding the efficiency of training AI on human-designed, often inefficient, UIs, comparing it to building a 'mechanical horse' rather than designing for optimal AI interaction.