Vision Banana: Google DeepMind’s Instruction-Tuned Model Outperforms SAM 3 and Depth Anything V3

Image Generators are Generalist Vision Learners

Google DeepMind has unveiled Vision Banana, a unified model that redefines computer vision by treating perception as an image generation task. The system achieves a 0.699 mIoU on Cityscapes, outperforming the specialized SAM 3 model. This breakthrough suggests that generative pretraining inherently encodes the deep geometric and semantic understanding required for advanced visual recognition.

Why This Matters

Historically, the computer vision community has separated generative and discriminative models, assuming that the weights required for photorealistic synthesis were fundamentally different from those needed for semantic extraction. This bifurcation led to a reliance on specialized architectures and task-specific decoder heads, which increased engineering overhead and limited the generalization of foundation models. Vision Banana proves that image generation pretraining serves as a universal foundational learner, mirroring the emergence of language understanding in LLMs. By parameterizing vision tasks as RGB images, the model achieves state-of-the-art performance across segmentation and depth estimation benchmarks without specialized modules or real-world depth training data.

Key Insights

Vision Banana (2026) achieves a 0.929 δ1 score on metric depth benchmarks, surpassing Depth Anything V3’s 0.918 while using only synthetic training data.
The model utilizes a strictly invertible power transform (λ = -3) to map unbounded metric depth values into the bounded RGB color space.
In semantic segmentation, Vision Banana reaches 0.699 mIoU on Cityscapes, representing a 4.7-point gain over SAM 3 in zero-shot transfer settings.
Reasoning segmentation capabilities are demonstrated by a 0.793 gIoU on ReasonSeg, outperforming in-domain trained models like X-SAM.
The instruction-tuning process preserves generative quality, maintaining a 53.5% win rate against the Nano Banana Pro base model on text-to-image benchmarks.

Practical Applications

Use case: Absolute metric depth estimation for autonomous vehicles using purely visual cues without camera parameters. Pitfall: Hardcoding camera intrinsics into the inference pipeline, which restricts the model’s native ability to infer scale from world knowledge.
Use case: Reasoning segmentation for security systems to identify objects based on complex natural language descriptions. Pitfall: Training on narrow in-domain datasets which leads to failure when the model encounters novel reasoning scenarios in the wild.
Use case: Precise surface normal estimation for 3D reconstruction in industrial manufacturing using RGB mappings. Pitfall: Assuming standard regression heads are superior to generative outputs, which ignores the rich geometric representations learned during image synthesis pretraining.

References:

https://www.marktechpost.com/2026/04/25/google-deepmind-introduces-vision-banana-an-instruction-tuned-image-generator-that-beats-sam-3-on-segmentation-and-depth-anything-v3-on-metric-depth-estimation/

On This Page

Image Generators are Generalist Vision Learners

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Spatial Supersensing as the Core Capability for Multimodal AI Systems

Meta AI Releases Segment Anything Model 3 (SAM 3) for Promptable Concept Segmentation in Images and Videos

Meta's SAM 3 Enhances Segmentation Accuracy and Speed for Vision Workflows