Running a 2-billion parameter model on a phone sounded ridiculous in 2022. It is now routine. Here is what changed.
Quantization is the unlock
Gemma 2B's raw weights are 5GB. Quantized to 4-bit (int4), they shrink to roughly 1.5GB. The quality loss is small (5-10% on benchmarks), the speed gain is enormous.
The Neural Engine
Apple's Neural Engine (ANE) is a dedicated matrix-math accelerator on every iPhone since 2017. In 2026 iPhones it executes 35 TOPS. Running Gemma through the ANE via Core ML or MLX delivers real-time inference on a pocket device.
Memory is the bottleneck
The phone needs enough RAM to hold the model weights. iPhone 15 Pro / 16 Pro: 8GB. iPhone 17 Pro: 12GB. That's enough for 2-9B models. Smaller iPhones (non-Pro): 6GB, limited to ~2B models.
Battery trade-offs
A minute of continuous LLM inference on an iPhone 15 Pro uses roughly 1.5% of battery. Sporadic usage (a few prompts per hour) is barely detectable. Streaming video eats more.
What this means for apps
On-device LLMs are no longer a party trick. They are the default architecture for any privacy-sensitive application in 2026. The cloud is a legacy choice.
About Sovereign — A privacy-first AI personal assistant that runs entirely on your iPhone. On-device LLM, zero-knowledge encryption, and a coach that learns from your own words. See how it works or visit the homepage.