Tech & Tools

How Gemma Runs on Your iPhone (Without Eating the Battery)

A short technical explainer of how open LLMs like Gemma fit on mobile hardware in 2026.

October 22, 2025·1 min read

Running a 2-billion parameter model on a phone sounded ridiculous in 2022. It is now routine. Here is what changed.

Quantization is the unlock

Gemma 2B's raw weights are 5GB. Quantized to 4-bit (int4), they shrink to roughly 1.5GB. The quality loss is small (5-10% on benchmarks), the speed gain is enormous.

The Neural Engine

Apple's Neural Engine (ANE) is a dedicated matrix-math accelerator on every iPhone since 2017. In 2026 iPhones it executes 35 TOPS. Running Gemma through the ANE via Core ML or MLX delivers real-time inference on a pocket device.

Memory is the bottleneck

The phone needs enough RAM to hold the model weights. iPhone 15 Pro / 16 Pro: 8GB. iPhone 17 Pro: 12GB. That's enough for 2-9B models. Smaller iPhones (non-Pro): 6GB, limited to ~2B models.

Battery trade-offs

A minute of continuous LLM inference on an iPhone 15 Pro uses roughly 1.5% of battery. Sporadic usage (a few prompts per hour) is barely detectable. Streaming video eats more.

What this means for apps

On-device LLMs are no longer a party trick. They are the default architecture for any privacy-sensitive application in 2026. The cloud is a legacy choice.


About Sovereign — A privacy-first AI personal assistant that runs entirely on your iPhone. On-device LLM, zero-knowledge encryption, and a coach that learns from your own words. See how it works or visit the homepage.

#gemma#on-device-llm#ios#technical

Keep reading

The private AI that runs on your phone

Sovereign is in private beta. Join the waitlist and we'll send you a TestFlight invite when your slot is ready.