Nano Banana: Google’s AR Image Model That Actually Follows Your Prompts

Google’s “Nano Banana” (Gemini 2.5 Flash Image) is an autoregressive image model that remarkably follows complex, structured prompts and performs precise multi-edit tasks, often beating ChatGPT’s image model in fidelity and composition. Tests spanning intricate object edits, subject conditioning, HTML rendering, and JSON-driven character creation show strong adherence powered by a Gemini-trained text encoder and a large context window. Weak style transfer and lenient moderation/IP controls are the main drawbacks, alongside the usual quirks of text-in-image rendering.
Key Points
- Nano Banana (Gemini 2.5 Flash Image) excels at prompt adherence and localized edits, handling complex, multi-constraint instructions far better than many diffusion models and ChatGPT’s gpt-image-1.
- Its autoregressive architecture, large context window (32k), and Gemini-trained text encoder (rich in Markdown/JSON understanding) enable precise control, subject consistency, and even coherent in-image text/code.
- Empirical tests include multi-edit object manipulation, subject conditioning with multiple references (Ugly Sonic + Obama), structured HTML/JSON prompting, and a stringent multi-rule kitten composition—all largely successful.
- Cost and access: free but watermarked images via Gemini/AI Studio; API use (~$0.04 per 1MP) provides unwatermarked outputs and is cheaper than gpt-image-1 (~$0.17); the author released a gemimg Python wrapper.
- Notable weaknesses: poor style transfer on user photos, and comparatively lenient IP/NSFW moderation that may allow brand/IP-heavy or adult content—posing legal and safety concerns.
Sentiment
Overall, the Hacker News sentiment is largely positive and intrigued by Nano Banana's capabilities, especially its superior prompt adherence and cost-effectiveness. However, this positive sentiment is balanced by significant nuanced discussion and specific critiques regarding its consistency in complex edits, its style transfer capabilities (which some found better than reported), and the deeper implications of 'prompt engineering.' While the core strengths are validated, practical limitations and conceptual debates add layers of critical examination.
In Agreement
- The `gemimg` Python library is a valuable and well-received tool that accompanies the model.
- Nano Banana is uniquely strong at following specific, granular instructions, a capability not all models possess.
- The model's ability to render HTML code into images is impressive and accurately reflects its strong prompt adherence.
- Prompt engineering is a necessary and legitimate skill, akin to defining a specification, especially for complex or iterative tasks where model ambiguities need careful handling (e.g., specifying 'distinct characters' and 'left to right' for multi-subject scenes).
- Nano Banana performs better than other models (like gpt-image-1) at masked image changes and maintaining details (texture, lighting, sharpness) due to its low spatial scaling and potential use of segmentation masks, challenging the notion that all generative edits regenerate entire images.
- The model offers a highly amenable interface for specific instructions, a key advantage highlighted in the article.
Opposed
- The article's assertion that Nano Banana is "terrible at style transfer" is contested, with users reporting success in applying new styles while maintaining scene geometry (e.g., transforming modern scenes to 18th-century styles).
- Nano Banana can exhibit specific spatial or directional errors in image generation, such as misplacing objects like fruits on the skull pancake relative to the prompt description, or generating details like upward-pointing blueberries.
- Despite its adherence strengths, the model can still make 'massive, seemingly random edits,' adjust image scale, and introduce pervasive detail changes (e.g., adding a fireplace) even with explicit negative instructions and low temperature settings, hindering reliable application building.
- Some argue that AI models fundamentally regenerate all details during edits, challenging the idea of 'only necessary aspects changed,' and therefore may not be suitable replacements for professional photo editing due to potential loss of nuance.
- The term 'prompt engineering' is seen by some as unnecessary jargon or a way to avoid developing actual design or coding skills, dismissing its validity as a distinct discipline.
- The article is critiqued for not delving deeply enough into the philosophical nuances of multimodal models, such as whether 'prompt rewriting' constitutes 'thinking' in the context of models like Imagen 4 Ultra versus Nano Banana.