Running a 2-billion parameter model on a phone sounded ridiculous in 2022. It is now routine. Here is what changed.

Quantization is the unlock

Gemma 2B's raw weights are 5GB. Quantized to 4-bit (int4), they shrink to roughly 1.5GB. The quality loss is small (5-10% on benchmarks), the speed gain is enormous.

The Neural Engine

Apple's Neural Engine (ANE) is a dedicated matrix-math accelerator on every iPhone since 2017. In 2026 iPhones it executes 35 TOPS. Running Gemma through the ANE via Core ML or MLX delivers real-time inference on a pocket device.

Memory is the bottleneck

The phone needs enough RAM to hold the model weights. iPhone 15 Pro / 16 Pro: 8GB. iPhone 17 Pro: 12GB. That's enough for 2-9B models. Smaller iPhones (non-Pro): 6GB, limited to ~2B models.

Battery trade-offs

A minute of continuous LLM inference on an iPhone 15 Pro uses roughly 1.5% of battery. Sporadic usage (a few prompts per hour) is barely detectable. Streaming video eats more.

What this means for apps

On-device LLMs are no longer a party trick. They are the default architecture for any privacy-sensitive application in 2026. The cloud is a legacy choice.

About Sovereign — A privacy-first AI personal assistant that runs entirely on your iPhone. On-device LLM, zero-knowledge encryption, and a coach that learns from your own words. See how it works or visit the homepage.

How Gemma Runs on Your iPhone (Without Eating the Battery)

Quantization is the unlock

The Neural Engine

Memory is the bottleneck

Battery trade-offs

What this means for apps

Keep reading

Flutter vs Swift for Privacy-First iOS Apps

Local LLMs: State of the Art in 2026

Apple MLX vs TensorFlow Lite: Which Should You Target?

The private AI that runs on your phone