Single‑Pass Image Editing Showdown: Style Wins, Precision Still Hard

A head-to-head benchmark of six image-editing models across 12 single-pass, text-only tasks finds Seedream 4 best overall (9/12), with Gemini 2.5 Flash second (7/12). Models succeed more with stylized insertions and global changes than with precise spatial swaps or selective removals. The most error-prone behaviors include over-editing (especially by gpt-image-1), violating constraints, and failing to preserve texture.
Key Points
- Seedream 4 tops the benchmark (9/12), with Gemini 2.5 Flash second (7/12); OmniGen2 lags (1/12).
- Precise, localized, and spatially constrained edits (block position swaps, M&M filtering, Pisa straightening, neck shortening) remain the hardest.
- Models generally handle stylized insertions and global aesthetic changes better (e.g., Great Wave surfer, PAWS poster, Girl with a Pearl Earring lighting).
- OpenAI gpt-image-1 often alters entire scenes, hurting constraint adherence; Qwen-Image-Edit is strong for a locally hostable model but can overmodify.
- Texture and style preservation under targeted replacement (e.g., weathered sign, card suit change without touching the Ace) are frequent failure points.
Sentiment
The Hacker News sentiment is largely constructive and mixed. While there is appreciation for the benchmark's utility and the realistic nature of the prompts, there's significant critical debate regarding the specific evaluations of models, the nuances of individual model performance (especially Gemini's strengths and weaknesses), and the underlying methodology of prompt design.
In Agreement
- AI-generated images still often look "off" or "brushed on" when interacting with real-world photographs, as seen in examples like the George's hair or the added tree.
- The trend of AI image generation moving towards online-hosted models is a real phenomenon, largely due to the increasing size and computational demands of models exceeding typical hobbyist self-hosted capabilities.
- Flux models show surprising quality and potential, despite not being as widely adopted as Gemini or ChatGPT.
- The prompts used in the article were realistic and representative of how a non-expert user might interact with these tools.
- The benchmark offers valuable insights into real-world model performance, going beyond simple charts.
Opposed
- The article's judging criteria for specific tasks were too lenient or flawed, particularly for the "add hair" task where only Gemini 2.5 Flash was deemed to truly pass without issues like altered color grading or unnatural appearance.
- The article's prompting methodology is questionable; providing models with context already evident in the image (e.g., "the tower in the image is leaning to the right") can confuse or bias the model into exaggerating that feature.
- Gemini 2.5 Flash, despite its strengths, can be inconsistent, sometimes producing completely unexpected results and failing repeatedly on certain tasks.
- Gemini (Nano Banana) struggles significantly with specific editing tasks, such as adding or removing exterior architectural elements like curbs or matching specific colors, which might not be fully highlighted by the benchmark.