According to 动察 Beating monitoring, Google has released the experimental open-source large model DiffusionGemma, which adopts a novel text generation mechanism based on diffusion, breaking the constraints of traditional large language models that generate word by word. DiffusionGemma has a total of 26 billion parameters, with only 3.8 billion parameters activated in each forward pass under a mixture of experts (MoE) architecture. It achieves up to a 4x speedup in local GPU inference by parallelizing the generation of entire blocks of text.
Unlike the traditional "typewriter-style" word-by-word generation, DiffusionGemma operates similarly to image generation, first generating random placeholders on a canvas and then iteratively erasing noise and locking in the correct text over multiple time steps. Each forward pass can parallelize the generation of 256 tokens, enabling bidirectional attention interaction for all tokens. The bidirectional attention mechanism demonstrates significant advantages in non-linear generation tasks such as code completion, in-line editing, and mathematical formula generation. However, the overall output quality of DiffusionGemma is currently lower than that of the standard Gemma 4.
In terms of hardware testing and inference speed performance, a single NVIDIA H100 GPU can achieve a generation speed of over 1000 tokens per second, while a consumer-grade NVIDIA GeForce RTX 5090 GPU can surpass 700 tokens. After 4-bit floating-point (NVFP4) quantization, the inference VRAM usage can be reduced to within 18GB, significantly lowering the barrier for local deployment.
DiffusionGemma's weights have been open-sourced on Hugging Face and have received support from mainstream development tools such as MLX, vLLM, Unsloth, and NVIDIA NeMo.
