NewsFlash Articles Data Fundraising Skill&API

All Details About the Most Powerful Open Source Model DeepSeek V4: Performance on Par with Opus 4.6, Price Reduction, Encoding Benchmark Topping

Read this article in 12 Minutes

V4-Pro-Max (Maximum Reasoning Power Mode) self-proclaims as the strongest open-source model to date, with its encoding benchmark reaching the top tier. The gap in inference and agent tasks with closed-source counterparts has significantly narrowed at the cutting edge.

Today, DeepSeek announced the open-source release of the V4 series preview, with weights now available on Hugging Face and ModelScope, under the MIT license. The series includes two MoE models: V4-Pro (total parameters 1.6T, per token activation 49B/490 billion) and V4-Flash (total parameters 284B/2840 billion, activation 13B/130 billion), both supporting a 1M token context.

There are three key architecture upgrades:

· Hybrid Attention Mechanism, including Compressed Sparse Attention (CSA) and Heavy Compression Attention (HCA), significantly reducing the long-context overhead. At 1M context, V4-Pro's single-token inference FLOPs are only 27% of V3.2, and KV cache usage is only 10% of V3.2.

· Manifold Constrained HyperConnection (mHC) replaces traditional residual connections, enhancing cross-layer signal propagation stability.

· Training switched to Muon optimizer for accelerated convergence. The total pretraining data volume exceeds 32T tokens.

Post-training is divided into two stages: first, reinforcement learning with SFT and GRPO trains domain experts separately, then unifying them into one model through online distillation.

Performance Evaluation: V4-Pro-Max Claims to Be the Most Powerful Current Open-Source Model

V4-Pro's highest inference-intensive mode is called V4-Pro-Max. The official technical report compares it with Opus 4.6 Max, GPT-5.4 xHigh, Gemini 3.1 Pro High, and open-source models Kimi K2.6, GLM-5.1 (excluding the newly released Opus 4.7 and GPT-5.5, with the final gap awaiting third-party validation).

On the encoding side, V4-Pro-Max achieves a score of 3206 in Codeforces, surpassing GPT-5.4 with 3168 and Gemini 3.1 Pro with 3052, setting a new benchmark record. LiveCodeBench scores 93.5, also the highest across the board. SWE Verified scores 80.6, slightly below Opus 4.6 at 80.8, a difference of 0.2 percentage points.

In terms of Long Context, both 1M benchmarks rank second: CorpusQA 1M scores 62.0 (Opus 4.6 scores 71.7), and MRCR 1M scores 83.5 (Opus 4.6 scores 92.9).

Regarding Agent tasks, MCPAtlas Public scores 73.6, just below Opus 4.6 at 73.8; Terminal-Bench 2.0 scores 67.9, lower than GPT-5.4 at 75.1 and Gemini 3.1 Pro at 68.5.

There is still a significant gap in Knowledge and Reasoning: GPQA Diamond 90.1 (Gemini 94.3), SimpleQA-Verified 57.9 (Gemini 75.6), HLE 37.7 (Gemini 44.4).

As an open-source model, V4-Pro-Max has for the first time matched or even surpassed some closed-source flagships on various encoding and long context benchmarks, but it still lags behind Gemini 3.1 Pro in knowledge-intensive evaluations.

Internal Dogfooding Data and Mathematical Reasoning

DeepSeek has rarely disclosed internal dogfooding data. The team collected around 200 real R&D tasks from over 50 engineers, covering feature development, bug fixes, refactoring, and diagnostics. The technology stack includes PyTorch, CUDA, Rust, C++, and after strict filtering, 30 were retained as the evaluation set.

V4-Pro-Max has a pass rate of 67%, significantly higher than Sonnet 4.5 at 47%, close to Opus 4.5 at 70%, but lower than Opus 4.5 Thinking at 73% and Opus 4.6 Thinking at 80%; Haiku 4.5 has a pass rate of only 13%. An internal survey of N=85 showed that all respondents use V4-Pro for agentic coding in their daily work, with 52% considering V4-Pro as the default primary coding model, 39% leaning towards acceptance, and less than 9% in denial. The main feedback issues included low-level errors, misinterpretation of vague prompts, and occasional overthinking.

In the realm of formal mathematical reasoning, the Putnam Competition is the most prestigious undergraduate mathematics competition in North America. In a practical regime, V4-Flash-Max scored 81.00 on the Putnam-200 Pass@8 benchmark using the open-source tool LeanExplore and restricted sampling; in comparison, Seed-2.0-Prover scored 35.50, Gemini 3 Pro and Seed-1.5-Prover scored 26.50.

In the frontier regime, V4 adopts a hybrid formal-informal reasoning approach, first using informal reasoning to generate candidate natural language solutions, which are then filtered through self-validation and subsequently rigorously proven by a formal agent in Lean. In Putnam-2025, V4 scored a perfect 120/120, tying with Axiom for first place, surpassing Seed-1.5-Prover's 110/120 and Aristotle's 100/120. The frontier regime utilized extensive computational scaling, with results in the practical regime better reflecting routine deployment capabilities.

API and Pricing: V4-Flash Price Reduction with Context Upgrade, V4-Pro Positioned as High-End Tier

The DeepSeek V4 API has been synchronized with V4-Pro and V4-Flash. The official account released pricing and computational power plans: V4-Flash directly replaces V3.2 (deepseek-chat), not only maintaining the same price but reducing it—unchanged cache hit input at 0.2 yuan per million tokens, cache miss input decreased from 2 yuan to 1 yuan (a 50% reduction), and output reduced from 3 yuan to 2 yuan (a 33% reduction). Context has expanded from 128K to 1M, enabling an 8x increase in context at a more affordable price. The old model names deepseek-chat and deepseek-reasoner will be deprecated on July 24, 2026, now respectively pointing to V4-Flash's non-reasoning mode and reasoning mode.

V4-Pro is a new high-end tier: 1 yuan for cache hit input, 12 yuan for a cache miss, and 24 yuan per million tokens for output, which is 8 times the price of V3.2. DeepSeek explained in the pricing table notes that due to limited high-end computational power, the current Pro service throughput is quite limited, and it is expected that after the ascension of 950 supernodes in the second half of the year, the price of Pro will be significantly reduced. Both models support non-reasoning and reasoning modes, with reasoning mode offering two levels of strength setting: high/max for the reasoning_effort parameter.

DeepSeek stated in the announcement: "Starting now, 1M context will be a standard feature of all DeepSeek official services."

Inaugural Infrastructure Release: Production-Grade Elastic Compute Sandbox DSec

The DeepSeek V4 Technical Report unveiled for the first time the core infrastructure supporting post-Agent training and massive-scale evaluation— the production-grade Elastic Compute Sandbox DSec (DeepSeek Elastic Compute).

The current large-scale reinforcement learning requires an extremely large code trial-and-error environment. The report reveals that in actual production, a single DSec cluster can concurrently schedule hundreds of thousands of sandbox instances. The system, written in Rust, interfaces with the in-house 3FS distributed file system, breaking through the performance bottleneck of massive sandbox cold start through hierarchical on-demand loading.

In terms of developer experience, DSec unifies function calls, containers, micro-VMs, and full VMs through a Python SDK set, requiring only one parameter modification when switching. To address the common task preemption issue in compute clusters, DSec introduces global trace logs: when tasks resume, the system will directly "fast-forward" to replay the cached command execution results, achieving both rapid checkpoint resumption and avoiding non-idempotent errors caused by repeated execution.

V4 Responds to "Adaptation Difficulty" Speculation with Data

Prior to the release of DeepSeek V4, there was widespread community speculation that the delay in V4's launch was due to adaptation challenges encountered when transitioning the model from NVIDIA to the Huawei Ascend platform. While the V4 Technical Report did not directly address this rumor, the performance data disclosed is notably contradictory.

The report demonstrates that V4's Fine-Grained EP Scheme has been successfully deployed and validated on both NVIDIA GPUs and Huawei Ascend NPUs, accelerating regular inference workloads by 1.50 to 1.73 times, with latency-sensitive scenarios such as RL rollout and high-speed Agent services achieving a maximum 1.96x speedup. The team has open-sourced the CUDA kernel version of MegaMoE as part of DeepGEMM. In other words, V4 has achieved near-theoretical efficiency on both hardware platforms, with cross-platform adaptation not leading to any performance degradation.

Welcome to join the official BlockBeats community:

Telegram Subscription Group: https://t.me/theblockbeats

Telegram Discussion Group: https://t.me/BlockBeats_App

Official Twitter Account: https://twitter.com/BlockBeatsAsia

#AI #DeepSeek

Correction/Report