Reduced memory usage
Lower memory requirements allow for the generation of longer samples on devices with limited memory, like single GPUs or CPUs.
Explore RecurrentGemma
RecurrentGemma is based on Griffin, a hybrid model architecture that mixes gated linear recurrences with local sliding window attention.
Lower memory requirements allow for the generation of longer samples on devices with limited memory, like single GPUs or CPUs.
Performs inference at significantly higher batch sizes. Capable of generating substantially more tokens per second — especially for long sequences.
RecurrentGemma matches Gemma's performance while requiring less memory and achieving faster inference.