Illustration for: Google Releases Gemma 4 Open Source AI Models for On-Device Use

Google Releases Gemma 4 Open Source AI Models for On-Device Use

Google on Thursday released Gemma 4, a family of four open source multimodal AI models supporting text, image, video and audio inputs and designed to run on consumer hardware.

The release, announced on the Hugging Face blog (https://huggingface.co/blog/gemma4), includes four model variants ranging from 2.3 billion to 31 billion parameters, all licensed under Apache 2.0. The models bring multimodal capabilities previously limited to cloud-hosted systems to laptops, phones and edge devices.

What’s in the Release

The Gemma 4 family comprises four models:

  • Gemma 4 E2B — 2.3 billion effective parameters with a 128,000-token context window, designed for the most resource-constrained environments
  • Gemma 4 E4B — 4.5 billion effective parameters with a 128,000-token context window, supporting image, video and audio processing
  • Gemma 4 26B A4B — A mixture-of-experts architecture activating 4 billion of its 26 billion total parameters per inference, with a 256,000-token context window
  • Gemma 4 31B — A 31 billion dense model with a 256,000-token context window, the largest and most capable in the family

Each model ships in both base and instruction-tuned variants. All four support multimodal inputs including images with variable aspect ratios, while the smaller E2B and E4B models add native audio processing — a capability the larger models do not include.

Benchmark Performance

The flagship 31B model scored 85.2% on MMLU Pro, 89.2% on AIME 2026 and 84.3% on GPQA Diamond, according to benchmarks published in the Hugging Face announcement. Its Codeforces ELO rating reached 2,150, and it achieved an LMArena score of approximately 1,452.

The mixture-of-experts 26B model, which activates only 4 billion parameters per query, posted an LMArena score of roughly 1,441 — approaching the dense 31B model’s performance at a fraction of the compute cost, according to the published results.

The smallest E2B model scored 60% on MMLU Pro and 37.5% on AIME 2026, according to the announcement, in a form factor small enough for mobile deployment.

Technical Architecture

Google introduced several architectural innovations in Gemma 4, including Per-Layer Embeddings, a parallel conditioning pathway that provides each decoder layer with a dedicated modulation vector. The models also use a shared key-value cache system in which later layers reuse cached tensors from earlier ones, reducing memory consumption during long-context generation, according to the technical documentation.

The attention mechanism alternates between local sliding-window layers spanning 512 to 1,024 tokens and global full-context layers, enabling efficient processing of the models’ extended context windows.

On-Device Ecosystem

The release launched with day-one support across five major inference frameworks: Hugging Face Transformers, llama.cpp, Transformers.js for browser-based inference via WebGPU, MLX for Apple Silicon and Mistral.rs for Rust-native deployment, according to the announcement. Quantized checkpoints in GGUF, ONNX and UQFF formats are available for reduced-precision deployment.

Competitive Landscape

The release adds to competition in the U.S. on-device AI market. Apple has been integrating AI capabilities into its devices through Apple Intelligence. Meta has released its Llama model family under permissive licenses. Qualcomm has been optimizing its mobile chips for on-device AI inference.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *