According to Perceive Beating monitoring, DeepSeek's web and app versions have officially launched Vision Mode, which is now available alongside Quick Mode and Expert Mode above the chat input box. The newly launched visual understanding capability is not merely text recognition (OCR) but is focused on deep scene analysis, spatial logical reasoning, and direct conversion of UI screenshots into structured HTML code. For challenging geometric deductions or complex chart analysis, the system will automatically trigger the deep thinking model to provide a complete chain of reasoning.
The Vision Mode is built on the DeepSeek team's disclosed "Thinking with Visual Primitives" research framework. A multimodal researcher, Xiaokang Chen, in a paper co-published with Peking University and Tsinghua University, highlighted the "Reference Gap" in existing visual language models when it comes to fine-grained positioning and spatial reasoning, making it difficult to describe complex visual coordinates with vague natural language. To address this, the research team elevated coordinate points and bounding boxes to the smallest thinking unit and directly inserted spatial primitives into the Chain of Thought (CoT) of visual reasoning models, achieving synchronous spatial referencing in the thought process.
An academic paper and open-source project serving as the foundation of the visual capability were briefly released on April 30 but promptly pulled back without prior notice by DeepSeek's officials on May 1, sparking industry speculation about excessive leakage of technical details and the model's subsequent optimization. The officially launched Vision Mode currently only supports image input, excluding support for video, audio, and other multimodal formats, and the model does not yet possess image generation capabilities.
