According to Dongcha Beating monitoring, a team from Google (including authors such as He Kaiming and She Sai-ning) published a paper proposing Vision Banana. They performed lightweight instruction fine-tuning on their in-house image generation model Nano Banana Pro (also known as Gemini 3 Pro Image) to transform it into a universal visual understanding model. The core idea is to parameterize the output of all visual tasks as RGB images, allowing perception tasks such as segmentation, depth estimation, and surface normal estimation to be completed through image generation without the need to design task-specific architectures or training losses for each task.
The evaluation covered two main categories of tasks: image segmentation and 3D geometric inference. In terms of segmentation, for semantic segmentation (assigning a category to each pixel in an image, such as "road," "pedestrian," "vehicle"), Vision Banana surpassed the specialized segmentation model SAM 3 by 4.7 percentage points on Cityscapes. For referential expression segmentation (finding and segmenting the corresponding object based on a natural language description, such as "the dog with a hat on the left"), it also outperformed SAM 3 Agent. However, in instance segmentation (distinguishing different individuals of the same category, such as labeling five dogs in an image), it still lagged behind SAM 3. In 3D tasks, for metric depth estimation (inferring the actual physical distance from the camera to each pixel in a single image), Vision Banana achieved an average accuracy of 0.929 on four standard datasets, surpassing the specialized model Depth Anything V3's 0.918. Moreover, it was trained solely on synthetic data without using real depth data and did not require camera parameters during inference. Surface normal estimation (inferring object surface orientation) achieved state-of-the-art results on three indoor benchmarks.
The fine-tuning process simply involved incorporating a small amount of visual task data into the original image generation training data, with the model's image generation capability remaining largely unaffected: it performed on par with the original Nano Banana Pro in terms of generation quality. The paper suggests that image generation pre-training plays a role in the visual domain similar to how text generation pre-training operates in the language domain: as the model learns to generate images, it has already acquired the internal representations necessary to understand images, with fine-tuning simply unleashing this capability.
