From Zero to 40 Tokens/Sec: Running Gemma 4 on Android via llama.cpp
The technical journey of building an on-device mental health companion with GGUF quantized models on ARM64
Solace addresses Health & Sciences, Digital Equity & Inclusivity, and Safety & Trust - bringing private, offline mental health support to anyone with an Android phone, powered by Google's Gemma 4 E2B model running locally via llama.cpp.
The Problem
One billion people worldwide suffer from mental health conditions. Most never receive care. The barriers are well-documented: cost, stigma, geography, and wait times. But there's a less obvious barrier - privacy. People in crisis won't talk to an app that phones home.
Solace solves this by running a 3-billion parameter language model entirely on-device. No API calls. No cloud inference. No data leaves the phone. The AI companion works on a plane, in a rural village, during a network outage - anywhere.
Architecture Overview
Solace is a 17-module Android application following Clean Architecture with MVVM. The native inference layer uses llama.cpp compiled with multimodal (mtmd) support.
Mood picks, crisis helplines] UI2[Chat Screen
Conversation, streaming] UI3[Guided Sessions
5 therapeutic templates] UI4[Settings Screen
Theme, voice, inference params] end subgraph ViewModel["ViewModel Layer"] VM1[ChatViewModel
1,696 lines] VM2[RoleplayViewModel
1,099 lines] VM3[ModelDownloadViewModel] VM4[SettingsViewModel] end subgraph Data["Data Layer"] D1[ConversationRepository] D2[ModelRepository] D3[SettingsRepository] D4[Room Database] D5[VoskSpeechManager] D6[KittenTtsEngine] D7[TtsTextFilter] end subgraph Runtime["Runtime Layer (Native C++)"] R1[GgufEngine
Kotlin JNI wrapper] R2[LLMInference
C++ class] R3[mtmd
Multimodal library] R4[GGML
Tensor operations] end subgraph Native["llama.cpp Submodule"] N1[ggml-cpu
ARM64 SIMD] N2[ggml-vulkan
GPU acceleration] N3[common
Tokenization, chat templates] end UI1 --> VM1 UI2 --> VM1 UI3 --> VM2 UI4 --> VM4 VM1 --> D1 VM1 --> D2 VM1 --> R1 VM2 --> R1 D1 --> D4 D5 --> VM1 D6 --> VM1 R1 -->|JNI| R2 R2 --> R3 R2 --> R4 R4 --> N1 R4 --> N2 R2 --> N3 style Presentation fill:#1A1714,stroke:#8AAEC4,color:#E8DDD0 style ViewModel fill:#1A1714,stroke:#D4C47A,color:#E8DDD0 style Data fill:#1A1714,stroke:#7AAE8E,color:#E8DDD0 style Runtime fill:#1A1714,stroke:#D4847A,color:#E8DDD0 style Native fill:#1A1714,stroke:#8A7A66,color:#E8DDD0
The GGUF-on-ARM Challenge
When we started, we had zero inference on Android. The llama.cpp library is designed for desktop/server environments. Getting it to compile, link, and run on Android's ARM64 architecture required solving several non-trivial problems.
Problem 1: Native Compilation
llama.cpp's CMake build system assumes a host environment. On Android, we cross-compile using the NDK. The key challenge was that LLAMA_BUILD_TOOLS had to be OFF (it pulled in dependencies that don't exist on Android), but we needed the mtmd (multimodal) library. The solution:
set(LLAMA_BUILD_TOOLS OFF CACHE BOOL "" FORCE)
# Manually add only the mtmd subdirectory
set(LLAMA_INSTALL_VERSION "0.0.0" CACHE STRING "" FORCE)
add_subdirectory(${LLAMA_DIR}/tools/mtmd mtmd)
This required setting LLAMA_INSTALL_VERSION because mtmd's CMakeLists.txt references it for set_target_properties.
Problem 2: ARM64 SIMD Optimization
Modern ARM64 chips have wildly different capabilities. A Snapdragon 8 Gen 3 has SVE+I8MM+FP16+Dot Product. A MediaTek Dimensity 9000 has Dot Product but no SVE. A budget Snapdragon 4 Gen 1 has only basic NEON.
We compile seven separate native libraries and select the best one at runtime:
Universal fallback] style D fill:#2E2920,stroke:#7AAE8E,color:#E8DDD0 style F fill:#2E2920,stroke:#7AAE8E,color:#E8DDD0 style H fill:#2E2920,stroke:#8AAEC4,color:#E8DDD0 style J fill:#2E2920,stroke:#8AAEC4,color:#E8DDD0 style L fill:#2E2920,stroke:#D4C47A,color:#E8DDD0 style M fill:#2E2920,stroke:#D4847A,color:#E8DDD0
The runtime detection reads /proc/cpuinfo, extracts CPU part IDs, and maps them to core types (Cortex-A55 = efficiency, Cortex-X3 = prime). This is critical - using the wrong SIMD path can mean the difference between 5 and 40 tokens per second.
Problem 3: Vulkan GPU Offload
llama.cpp supports Vulkan for GPU inference, but Android's Vulkan implementation varies wildly. We auto-detect vulkan.hpp headers at build time:
find_path(VULKAN_HPP_INCLUDE_DIR NAMES vulkan/vulkan.hpp ...)
if(VULKAN_HPP_INCLUDE_DIR)
set(GGML_VULKAN ON CACHE BOOL "" FORCE)
endif()
At runtime, we let users configure GPU offload layers in Settings. The optimal layer count depends on model size and available VRAM.
Performance Journey
This is where the story gets interesting. We started at zero tokens per second - the model wouldn't even load. Here's the progression:
Phase 1: It Doesn't Work (Week 1)
First attempts produced immediate crashes. The GGUF model file loaded, but llama_decode() returned -1. Root cause: context size mismatch. The model metadata reported 128K context, but we were initializing with 2048. Fix: read the GGUF header first with our GGUFReader before creating the context.
Phase 2: 2 Tokens/Sec - Qwen 0.8B (Week 1-2)
With the model finally loading, we got a painful 2 tokens/sec on Qwen 0.8B (our test model). The bottleneck was our batch configuration:
// Before: default batch sizes
nBatch = 512, nUbatch = 512
// After: auto-tuned based on thread count
nBatch = getPerformanceBatchSize() // 1024 for 12+ threads
nUbatch = getPerformanceUbatchSize() // 512 for 12+ threads
Also critical: we were running on the universal llama_android library - no SIMD optimizations. Just fixing the batch config brought us to ~8 tokens/sec on Qwen 0.8B.
Phase 3: 40 Tokens/Sec - Qwen 0.8B with SIMD (Week 2-3)
Adding ARM64-specific compilation was the breakthrough. The llama_android_v8_4_fp16_dotprod_i8mm_sve library uses:
- FP16 - half-precision floating point, 2x throughput on supported cores
- Dot Product - integer dot product instructions, critical for quantized inference
- I8MM - int8 matrix multiply, 4x throughput for INT8 quantized models
- SVE - Scalable Vector Extension, auto-vectorizes to whatever width the hardware supports
Result: 40 tokens/sec on Qwen 0.8B Q4_K_M. A 20x improvement from where we started.
Phase 4: 10-12 Tokens/Sec - Gemma 4 E2B (Week 3-4)
Gemma 4 E2B is a 3-billion parameter model (vs Qwen 0.8B's 800M). It's ~4x larger, so we expected ~10 tokens/sec, and that's exactly what we got: 10-12 tokens/sec on a Snapdragon 8 Gen 3 device with the optimized library.
Performance breakdown on a typical generation:
With Vulkan GPU offload (36 layers), token generation improves to ~16-18 tokens/sec on devices with Adreno 750 GPU.
Multimodal Vision Pipeline
Gemma 4 supports native image understanding through a vision projector (mmproj). The pipeline from image to language model:
The critical detail: the mtmd_default_marker() must be inserted at the correct position in the formatted prompt - at the start of the last user turn. For Gemma 4's chat template, this means finding <|turn>user\n and inserting the marker after it.
Offline Voice Pipeline
For users in crisis who can't type, voice interaction is essential. We needed a fully offline pipeline:
Offline, ~40MB model] VOSK --> TEXT[Transcribed text] end subgraph Process["AI Processing"] TEXT --> VM[ChatViewModel] VM --> ENGINE[GgufEngine
Gemma 4 E2B] ENGINE --> RESPONSE[AI Response] end subgraph Output["Voice Output"] RESPONSE --> FILTER[TtsTextFilter
Strip thinking blocks] FILTER --> TTS[KittenTtsEngine
ONNX, ~23MB model] TTS --> SPEAKER[Speaker] end style Input fill:#1A1714,stroke:#7AAE8E,color:#E8DDD0 style Process fill:#1A1714,stroke:#8AAEC4,color:#E8DDD0 style Output fill:#1A1714,stroke:#D4C47A,color:#E8DDD0
Key challenges solved:
- Vosk model download - alphacephei.com hosted the model, but URLs were unreliable. Added retry logic with exponential backoff and a GitHub mirror fallback.
- TTS thinking block filtering - Gemma 4 outputs internal reasoning in
<|channel>thoughtblocks. These must be stripped before TTS, or the user hears the model's internal monologue. - Module boundaries - VoskSpeechManager and KittenTtsEngine lived in the
app/module but ChatViewModel was infeature-chat/. Solved by moving shared classes tocore-data/.
Data Flow: Complete Chat Message
Module Architecture
3.1GB GGUF] MMPROJ[Vision mmproj
941MB] VOSK_M[Vosk Model
40MB] KITTEN[KittenTTS
23MB ONNX] end MODEL --> RGGUF MMPROJ --> RGGUF VOSK_M --> CD KITTEN --> CD style APP fill:#242019,stroke:#8AAEC4,color:#E8DDD0 style Native fill:#242019,stroke:#D4847A,color:#E8DDD0 style External fill:#242019,stroke:#7AAE8E,color:#E8DDD0
Key Technical Decisions
| Decision | Why | Trade-off |
|---|---|---|
| Q4_K_M quantization | Best quality/speed ratio for 3B models on mobile | ~3.1GB file size, requires 4GB+ RAM |
| llama.cpp over ONNX Runtime | Native ARM64 SIMD, Vulkan GPU, chat template support | Complex CMake build, JNI bridge required |
| 7 ARM64 library variants | Maximize performance on each SoC | Larger APK (+40MB per variant) |
| Vosk over Android SpeechRecognizer | Fully offline, no Google dependency, streaming | 40MB model download, slightly lower accuracy |
| KittenTTS over system TTS | Consistent voice, no cloud dependency, ONNX | 23MB bundled, lower quality than cloud TTS |
| DuckDuckGo for web search | No API key required, works on Android | HTML scraping is fragile |
| mmproj as separate download | Vision is optional, saves 941MB for text-only users | Extra download step for vision users |
Impact & Accessibility
Solace is designed for the places that need it most:
- Rural areas with no internet - the model runs entirely offline after initial download
- Crisis situations where typing isn't possible - voice input/output handles the full loop
- Privacy-sensitive contexts - no data ever leaves the device, no accounts required
- Low-resource devices - works on Android 12+ phones with 4GB+ RAM
- Multilingual potential - Gemma 4 supports 100+ languages natively
What We Learned
- ARM64 SIMD matters more than anything. The difference between the universal library and the I8MM+SVE variant is 8x on quantized models. This is the single biggest optimization lever.
- Batch size tuning is critical. Default llama.cpp batch sizes are designed for servers. Mobile needs different ratios - larger nBatch for prompt processing, smaller nUbatch for memory-constrained generation.
- GGUF metadata is your friend. Reading the chat template and context size from the GGUF header prevents a whole class of runtime errors.
- The mmproj pipeline works. Gemma 4's multimodal support via mtmd is production-ready on Android - we didn't need to modify any C++ code for vision, just wire the JNI bridge.
- Mental health AI needs guardrails. Crisis keyword detection, helpline numbers, and the system prompt design were as important as the technical implementation.
Build It Yourself
git clone --recurse-submodules https://github.com/HenshinLabs/solace-gemma4good.git
cd solace-gemma4good
echo "sdk.dir=/path/to/Android/Sdk" > local.properties
./gradlew assembleDebug
# Output: app/build/outputs/apk/debug/Solace-v2.0.5-debug.apk
Full documentation: docs/
Links
- Repository: github.com/HenshinLabs/solace-gemma4good
- Release: v2.0.5 (APK downloads)
- Website: henshinlabs.github.io/solace-gemma4good
- Based on: MasterLLM by Shuvam Banerji Seal
This project was developed with the assistance of Gemma 4 31B IT running on an NVIDIA A100 80GB GPU as a coding assistant. The same Gemma 4 family - the E2B variant - powers the on-device inference in the final application.