NewsFlash Articles Data Fundraising Skill&API

Google Vision Banana: The "GPT-3 Moment" of Computer Vision? Zero-Shot Model Beats Specialized Vision Models

According to Dongcha Beating monitoring, a team from Google (including authors such as He Kaiming and She Sai-ning) published a paper proposing Vision Banana. They performed lightweight instruction fine-tuning on their in-house image generation model Nano Banana Pro (also known as Gemini 3 Pro Image) to transform it into a universal visual understanding model. The core idea is to parameterize the output of all visual tasks as RGB images, allowing perception tasks such as segmentation, depth estimation, and surface normal estimation to be completed through image generation without the need to design task-specific architectures or training losses for each task.

The evaluation covered two main categories of tasks: image segmentation and 3D geometric inference. In terms of segmentation, for semantic segmentation (assigning a category to each pixel in an image, such as "road," "pedestrian," "vehicle"), Vision Banana surpassed the specialized segmentation model SAM 3 by 4.7 percentage points on Cityscapes. For referential expression segmentation (finding and segmenting the corresponding object based on a natural language description, such as "the dog with a hat on the left"), it also outperformed SAM 3 Agent. However, in instance segmentation (distinguishing different individuals of the same category, such as labeling five dogs in an image), it still lagged behind SAM 3. In 3D tasks, for metric depth estimation (inferring the actual physical distance from the camera to each pixel in a single image), Vision Banana achieved an average accuracy of 0.929 on four standard datasets, surpassing the specialized model Depth Anything V3's 0.918. Moreover, it was trained solely on synthetic data without using real depth data and did not require camera parameters during inference. Surface normal estimation (inferring object surface orientation) achieved state-of-the-art results on three indoor benchmarks.

The fine-tuning process simply involved incorporating a small amount of visual task data into the original image generation training data, with the model's image generation capability remaining largely unaffected: it performed on par with the original Nano Banana Pro in terms of generation quality. The paper suggests that image generation pre-training plays a role in the visual domain similar to how text generation pre-training operates in the language domain: as the model learns to generate images, it has already acquired the internal representations necessary to understand images, with fine-tuning simply unleashing this capability.

Source

Correction/Report

On-Chain Activity

15min ago

OpenAI unleashed a series of product updates this week, with insiders referring to it as a "release and deploy" strategy, paving the way for GPT-5.5's imminent arrival.

On-chain S&P 500 Index TOP1 Holder Bullish on Future Market, Long Position Reaches $66.28 million

Binance Dual Investment for Futures now supports XAUt

Robinhood Receives In-Principle Approval to Launch Brokerage in Singapore

Correction/Report

Submit

Add Library

Visible to myself only

Public

Save

Choose Library

Add Library

Cancel

Finish

Google Vision Banana: The "GPT-3 Moment" of Computer Vision? Zero-Shot Model Beats Specialized Vision Models

On-chain S&P 500 Index TOP1 Holder Bullish on Future Market, Long Position Reaches $66.28 million

A certain weekly chart trader entered a long position on ETH, opening a $16.34 million long position.

A certain high-leverage ETH short position closing plan is set to take profit, with the price target dipping to $2,280.

A whale bought 7,447.7 ETH, worth $17.52 million