According to Dochat Beating's monitoring, NVIDIA released a blog dissecting inference hardware selection, with the core argument summed up in one sentence: Evaluating inference infrastructure should consider the "cost per token" rather than "cost per GPU per hour." When comparing GPUs based on unit price, Blackwell is more expensive; however, when considering token cost, Blackwell outperforms the previous generation.
The blog focuses on DeepSeek-R1 (MoE inference model) as the test subject, comparing Blackwell (GB300 NVL72) to the previous generation Hopper (HGX H200). Based on cloud market leasing references, Blackwell costs $2.65 per GPU per hour, almost twice as expensive as Hopper's $1.41. But with a single GPU's token output per second jumping from 90 to 6000, a 65x throughput improvement is achieved. The cost per million tokens decreases from $4.20 to $0.12 when averaged out. Token output per megawatt increases by 50x.
A key point to note: the $0.12 figure is based on FP4 low-precision inference plus MTP (Multi-Token Prediction, which allows the model to generate multiple tokens at once to accelerate). Based on SemiAnalysis InferenceX v2 raw data, running DeepSeek-R1 on the same GB300 NVL72 without MTP results in a cost of approximately $2.35 per million tokens. When MTP is activated, this cost drops to around $0.11, showcasing a 21x difference from this single optimization alone. All the aforementioned results are based on tests of the DeepSeek-R1 single model; numbers may vary for different model architectures and scales.
