Gemini 2.5 Computer Use: High‑performance, safe UI control via API

Added Oct 7, 2025
Article: PositiveCommunity: PositiveDivisive
Gemini 2.5 Computer Use: High‑performance, safe UI control via API

Google DeepMind launched the Gemini 2.5 Computer Use model via the Gemini API to let agents interact with UIs by iteratively analyzing screenshots and issuing structured actions. It leads major web and mobile control benchmarks while offering lower latency and built-in safety measures, plus per-step action review and confirmation controls. The model is available in public preview with demos, docs, and reference implementations in Google AI Studio and Vertex AI.

Key Points

  • New specialized Computer Use model (built on Gemini 2.5 Pro) lets agents operate web and mobile UIs via a computer_use tool that runs in an observe–act loop using screenshots and action history.
  • Leads on key web/mobile control benchmarks (Online-Mind2Web, WebVoyager, AndroidWorld) and achieves lower latency per Browserbase evaluations.
  • Safety-by-design plus developer guardrails: per-step safety review of proposed actions and system instructions to refuse or require confirmation for high-risk operations.
  • Early adopters report strong results in speed, reliability, and UI test resilience; model underpins projects like Project Mariner, Firebase Testing Agent, and some AI Mode in Search capabilities.
  • Available now in public preview through the Gemini API on Google AI Studio and Vertex AI, with demos, docs, and reference code to get started.

Sentiment

The prevailing mood is one of measured skepticism. Most commenters who tried the demo were disappointed by its current reliability and speed. There is broad agreement that the technology has strong potential for specific async automation tasks, but considerable doubt about whether the screenshot-based approach is the right long-term architecture. The discussion is notably constructive, with the Google team actively engaging and several users sharing real-world experiences. The sentiment tilts positive toward the concept but negative toward the current execution quality.

In Agreement

  • Screenshot-based computer use is a pragmatic approach because most apps lack APIs and accessibility metadata is unreliable; the visual interface is the only one that was actually designed and tested
  • Background and async automation for tedious, repetitive enterprise tasks like expense reporting, form filling, and cross-system data entry is a compelling use case
  • The best near-term pattern is using LLMs to explore and learn a UI, then generate deterministic scripts for repeated execution rather than keeping the LLM in the loop every time
  • Computer use benchmarks are among the most important AI benchmarks to watch for forecasting labor-market impact
  • The technology raises genuine concerns about CAPTCHA circumvention, fraud, and the erosion of bot-detection mechanisms

Opposed

  • Screenshots are an inefficient and fundamentally wrong abstraction layer; AI should consume structured data like DOM trees, accessibility trees, or APIs rather than visually decode rendered images
  • The model is underwhelming in practice, struggling with basic tasks, prematurely stopping, and failing at precise interactions despite impressive benchmark numbers
  • Google's restraint on CAPTCHA solving is a competitive disadvantage since competitors will solve CAPTCHAs anyway
  • Computer use models are overhyped and reminiscent of late-stage cryptocurrency enthusiasm before a crash
  • LLMs are fundamentally too nondeterministic for any real enterprise governance guarantees, making agent safety claims dubious
Gemini 2.5 Computer Use: High‑performance, safe UI control via API | TD Stuff