Blazing fast inference
By shifting the decode bottleneck from memory-bandwidth to raw compute, DiffusionGemma generates up to 4x-5x faster token output on NVIDIA GPUs (achieving over 1,000 tokens per second on a single H100).
Built on Gemma 4 and Gemini Diffusion research, it prioritizes unprecedented speed and parallel layout generation, unlocking novel workflows for developers building real-time interactive AI applications.
By shifting the decode bottleneck from memory-bandwidth to raw compute, DiffusionGemma generates up to 4x-5x faster token output on NVIDIA GPUs (achieving over 1,000 tokens per second on a single H100).
Operates as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference. It fits comfortably within the 24GB VRAM limits of a consumer NVIDIA RTX 5090 or 4090 when quantized.
Generating 256 tokens in parallel with each forward pass allows every token to attend to all others. This provides significant advantages for non-linear domains such as in-line editing and code infilling.
Extract The model iteratively refines its own output, allowing it to evaluate the entire text block at once to perfectly close complex formatting and fix mistakes in real-time. data from medical lab reports
Native support for NVIDIA's new NVFP4 (4-bit floating-point) format on Blackwell GPUs dramatically accelerates compute throughput, allowing the model to run at faster speeds with near-lossless accuracy.