Google DeepMind has released DiffusionGemma, an experimental open source text generation model now optimised for enhanced performance on Nvidia’s GeForce RTX GPUs, Nvidia RTX PRO platform and Nvidia DGX Spark systems.
This adaptation aims to support text generation tasks with significantly reduced latency across a range of local hardware configurations, from personal computers to cloud environments.
Access deeper industry intelligence
Experience unmatched clarity with a single platform that combines unique data, AI, and human expertise.
DiffusionGemma differs from conventional large language models, which usually generate text sequentially and produce one token at a time based on the preceding word. In contrast, the new model can generate up to 256 tokens in parallel during each step, creating entire blocks of text at once.
This parallel approach is positioned to benefit developers, researchers and AI practitioners who conduct single-user workloads, such as interactive chat applications and on-device assistants, by offering faster response times.
The model is built upon Gemma 4, a 26-billion-parameter mixture-of-experts (MoE) architecture, in which only 3.8 billion parameters are activated per inference step. This configuration enables the model to fit within the memory constraints of high-end consumer GPUs, reportedly operating within 18GB of VRAM when quantised.
Nvidia has tailored DiffusionGemma to capitalise on its hardware strengths, citing compatibility with Nvidia Tensor Cores and the CUDA software environment.
As a result, the model achieves measurable speed gains: official figures indicate throughput of 1,000 tokens per second on a single Nvidia H100 Tensor Core GPU, 150 tokens per second on Nvidia DGX Spark and up to 2,000 tokens per second on Nvidia DGX Station.
The companies state these speeds are approximately four times faster than those of similar autoregressive models under single-user conditions.
DiffusionGemma uses bi-directional attention, which enables each token generated in a block to reference every other token within that same block. This approach may offer benefits in tasks that require non-linear outputs, such as code infilling or working with mathematical and amino acid sequences.
The architecture also incorporates an iterative self-correction mechanism, refining output across the entire block at each step.
Google DeepMind notes that DiffusionGemma is published under an Apache 2.0 license and supported from launch in platforms such as Hugging Face Transformers, vLLM, and Unsloth.
However, the model remains experimental and is recommended for applications prioritising speed and iterative interaction rather than maximum text quality, for which standard Gemma 4 remains preferred.
For high-throughput, cloud-based workloads, the company notes that traditional autoregressive models may retain efficiency advantages.
