← Back to Solace

From Zero to 40 Tokens/Sec: Running Gemma 4 on Android via llama.cpp

The technical journey of building an on-device mental health companion with GGUF quantized models on ARM64

HenshinLabs May 2026 Gemma 4 Good Hackathon llama.cpp Track
Submission for the Gemma 4 Good Hackathon

Solace addresses Health & Sciences, Digital Equity & Inclusivity, and Safety & Trust - bringing private, offline mental health support to anyone with an Android phone, powered by Google's Gemma 4 E2B model running locally via llama.cpp.

The Problem

One billion people worldwide suffer from mental health conditions. Most never receive care. The barriers are well-documented: cost, stigma, geography, and wait times. But there's a less obvious barrier - privacy. People in crisis won't talk to an app that phones home.

Solace solves this by running a 3-billion parameter language model entirely on-device. No API calls. No cloud inference. No data leaves the phone. The AI companion works on a plane, in a rural village, during a network outage - anywhere.

Architecture Overview

Solace is a 17-module Android application following Clean Architecture with MVVM. The native inference layer uses llama.cpp compiled with multimodal (mtmd) support.

graph TB subgraph Presentation["Presentation Layer"] UI1[Home Screen
Mood picks, crisis helplines] UI2[Chat Screen
Conversation, streaming] UI3[Guided Sessions
5 therapeutic templates] UI4[Settings Screen
Theme, voice, inference params] end subgraph ViewModel["ViewModel Layer"] VM1[ChatViewModel
1,696 lines] VM2[RoleplayViewModel
1,099 lines] VM3[ModelDownloadViewModel] VM4[SettingsViewModel] end subgraph Data["Data Layer"] D1[ConversationRepository] D2[ModelRepository] D3[SettingsRepository] D4[Room Database] D5[VoskSpeechManager] D6[KittenTtsEngine] D7[TtsTextFilter] end subgraph Runtime["Runtime Layer (Native C++)"] R1[GgufEngine
Kotlin JNI wrapper] R2[LLMInference
C++ class] R3[mtmd
Multimodal library] R4[GGML
Tensor operations] end subgraph Native["llama.cpp Submodule"] N1[ggml-cpu
ARM64 SIMD] N2[ggml-vulkan
GPU acceleration] N3[common
Tokenization, chat templates] end UI1 --> VM1 UI2 --> VM1 UI3 --> VM2 UI4 --> VM4 VM1 --> D1 VM1 --> D2 VM1 --> R1 VM2 --> R1 D1 --> D4 D5 --> VM1 D6 --> VM1 R1 -->|JNI| R2 R2 --> R3 R2 --> R4 R4 --> N1 R4 --> N2 R2 --> N3 style Presentation fill:#1A1714,stroke:#8AAEC4,color:#E8DDD0 style ViewModel fill:#1A1714,stroke:#D4C47A,color:#E8DDD0 style Data fill:#1A1714,stroke:#7AAE8E,color:#E8DDD0 style Runtime fill:#1A1714,stroke:#D4847A,color:#E8DDD0 style Native fill:#1A1714,stroke:#8A7A66,color:#E8DDD0

The GGUF-on-ARM Challenge

When we started, we had zero inference on Android. The llama.cpp library is designed for desktop/server environments. Getting it to compile, link, and run on Android's ARM64 architecture required solving several non-trivial problems.

Problem 1: Native Compilation

llama.cpp's CMake build system assumes a host environment. On Android, we cross-compile using the NDK. The key challenge was that LLAMA_BUILD_TOOLS had to be OFF (it pulled in dependencies that don't exist on Android), but we needed the mtmd (multimodal) library. The solution:

set(LLAMA_BUILD_TOOLS OFF CACHE BOOL "" FORCE)
# Manually add only the mtmd subdirectory
set(LLAMA_INSTALL_VERSION "0.0.0" CACHE STRING "" FORCE)
add_subdirectory(${LLAMA_DIR}/tools/mtmd mtmd)

This required setting LLAMA_INSTALL_VERSION because mtmd's CMakeLists.txt references it for set_target_properties.

Problem 2: ARM64 SIMD Optimization

Modern ARM64 chips have wildly different capabilities. A Snapdragon 8 Gen 3 has SVE+I8MM+FP16+Dot Product. A MediaTek Dimensity 9000 has Dot Product but no SVE. A budget Snapdragon 4 Gen 1 has only basic NEON.

We compile seven separate native libraries and select the best one at runtime:

graph LR A[App Start] --> B{Read /proc/cpuinfo} B --> C{ARMv8.4+SVE+I8MM?} C -->|Yes| D[llama_android_v8_4_fp16_dotprod_i8mm_sve] C -->|No| E{ARMv8.4+I8MM?} E -->|Yes| F[llama_android_v8_4_fp16_dotprod_i8mm] E -->|No| G{ARMv8.4+DotProd?} G -->|Yes| H[llama_android_v8_4_fp16_dotprod] G -->|No| I{ARMv8.2+DotProd?} I -->|Yes| J[llama_android_v8_2_fp16_dotprod] I -->|No| K{ARMv8.2+FP16?} K -->|Yes| L[llama_android_v8_2_fp16] K -->|No| M[llama_android
Universal fallback] style D fill:#2E2920,stroke:#7AAE8E,color:#E8DDD0 style F fill:#2E2920,stroke:#7AAE8E,color:#E8DDD0 style H fill:#2E2920,stroke:#8AAEC4,color:#E8DDD0 style J fill:#2E2920,stroke:#8AAEC4,color:#E8DDD0 style L fill:#2E2920,stroke:#D4C47A,color:#E8DDD0 style M fill:#2E2920,stroke:#D4847A,color:#E8DDD0

The runtime detection reads /proc/cpuinfo, extracts CPU part IDs, and maps them to core types (Cortex-A55 = efficiency, Cortex-X3 = prime). This is critical - using the wrong SIMD path can mean the difference between 5 and 40 tokens per second.

Problem 3: Vulkan GPU Offload

llama.cpp supports Vulkan for GPU inference, but Android's Vulkan implementation varies wildly. We auto-detect vulkan.hpp headers at build time:

find_path(VULKAN_HPP_INCLUDE_DIR NAMES vulkan/vulkan.hpp ...)
if(VULKAN_HPP_INCLUDE_DIR)
 set(GGML_VULKAN ON CACHE BOOL "" FORCE)
endif()

At runtime, we let users configure GPU offload layers in Settings. The optimal layer count depends on model size and available VRAM.

Performance Journey

This is where the story gets interesting. We started at zero tokens per second - the model wouldn't even load. Here's the progression:

Phase 1: It Doesn't Work (Week 1)

First attempts produced immediate crashes. The GGUF model file loaded, but llama_decode() returned -1. Root cause: context size mismatch. The model metadata reported 128K context, but we were initializing with 2048. Fix: read the GGUF header first with our GGUFReader before creating the context.

Phase 2: 2 Tokens/Sec - Qwen 0.8B (Week 1-2)

With the model finally loading, we got a painful 2 tokens/sec on Qwen 0.8B (our test model). The bottleneck was our batch configuration:

// Before: default batch sizes
nBatch = 512, nUbatch = 512

// After: auto-tuned based on thread count
nBatch = getPerformanceBatchSize() // 1024 for 12+ threads
nUbatch = getPerformanceUbatchSize() // 512 for 12+ threads

Also critical: we were running on the universal llama_android library - no SIMD optimizations. Just fixing the batch config brought us to ~8 tokens/sec on Qwen 0.8B.

Phase 3: 40 Tokens/Sec - Qwen 0.8B with SIMD (Week 2-3)

Adding ARM64-specific compilation was the breakthrough. The llama_android_v8_4_fp16_dotprod_i8mm_sve library uses:

Result: 40 tokens/sec on Qwen 0.8B Q4_K_M. A 20x improvement from where we started.

Key insight: The I8MM and SVE instructions are what make quantized GGUF models fast on ARM. Without them, the CPU falls back to scalar operations. A Q4_K_M quantized model benefits enormously from these instructions because the dequantization math is integer-heavy.

Phase 4: 10-12 Tokens/Sec - Gemma 4 E2B (Week 3-4)

Gemma 4 E2B is a 3-billion parameter model (vs Qwen 0.8B's 800M). It's ~4x larger, so we expected ~10 tokens/sec, and that's exactly what we got: 10-12 tokens/sec on a Snapdragon 8 Gen 3 device with the optimized library.

Performance breakdown on a typical generation:

Prompt processing
~85 tokens/sec
Token generation
10-12 tokens/sec
First token latency
~1.2 seconds
Context window
128K tokens
Model load time
~3 seconds

With Vulkan GPU offload (36 layers), token generation improves to ~16-18 tokens/sec on devices with Adreno 750 GPU.

Multimodal Vision Pipeline

Gemma 4 supports native image understanding through a vision projector (mmproj). The pipeline from image to language model:

sequenceDiagram participant U as User participant K as Kotlin (ChatViewModel) participant J as JNI Bridge participant C as C++ (LLMInference) participant M as mtmd Library U->>K: Attach image + type message K->>K: Bitmap → RGB byte array K->>J: startCompletionWithImage(prompt, rgbBytes, w, h) J->>C: JNI call with byte array C->>M: mtmd_bitmap_init(width, height, rgbBytes) C->>M: Insert __media__ marker in prompt C->>M: mtmd_tokenize(ctx, chunks, &text, bitmaps, 1) M->>M: CLIP vision encoder processes image M->>M: Projector maps to language embedding space C->>M: mtmd_helper_eval_chunks() C->>C: Token generation loop C-->>J: Return token pieces J-->>K: Stream to UI K-->>U: Show response with image context

The critical detail: the mtmd_default_marker() must be inserted at the correct position in the formatted prompt - at the start of the last user turn. For Gemma 4's chat template, this means finding <|turn>user\n and inserting the marker after it.

Offline Voice Pipeline

For users in crisis who can't type, voice interaction is essential. We needed a fully offline pipeline:

graph LR subgraph Input["Voice Input"] MIC[Microphone] --> VOSK[Vosk ASR
Offline, ~40MB model] VOSK --> TEXT[Transcribed text] end subgraph Process["AI Processing"] TEXT --> VM[ChatViewModel] VM --> ENGINE[GgufEngine
Gemma 4 E2B] ENGINE --> RESPONSE[AI Response] end subgraph Output["Voice Output"] RESPONSE --> FILTER[TtsTextFilter
Strip thinking blocks] FILTER --> TTS[KittenTtsEngine
ONNX, ~23MB model] TTS --> SPEAKER[Speaker] end style Input fill:#1A1714,stroke:#7AAE8E,color:#E8DDD0 style Process fill:#1A1714,stroke:#8AAEC4,color:#E8DDD0 style Output fill:#1A1714,stroke:#D4C47A,color:#E8DDD0

Key challenges solved:

Data Flow: Complete Chat Message

sequenceDiagram participant U as User participant UI as ChatScreen participant VM as ChatViewModel participant DB as Room Database participant EG as GgufEngine participant CPP as LLMInference (C++) participant TTS as KittenTtsEngine U->>UI: Type message + tap send UI->>VM: ChatAction.SendMessage VM->>VM: Resolve selected model VM->>DB: Create/get conversation VM->>DB: Add user message VM->>EG: Check supportsVision() VM->>EG: load() if not loaded EG->>CPP: JNI loadModel() CPP->>CPP: Read GGUF header, create context CPP->>CPP: Build sampler chain VM->>EG: getResponseAsFlow(text) EG->>CPP: JNI startCompletion() CPP->>CPP: Apply chat template CPP->>CPP: Tokenize prompt loop Token Generation CPP->>CPP: llama_decode() → sample → piece CPP-->>EG: Return piece EG-->>VM: Emit to Flow VM->>UI: Update streamingText end CPP->>CPP: [EOG] reached VM->>DB: Add assistant message VM->>TTS: speak(filteredText) TTS->>TTS: ONNX inference → AudioTrack style U fill:#2E2920,stroke:#8AAEC4,color:#E8DDD0 style CPP fill:#2E2920,stroke:#D4847A,color:#E8DDD0

Module Architecture

graph TB APP[app/] --> CD[core-data/] APP --> CUI[core-ui/] APP --> RGGUF[runtime-gguf/] APP --> FC[feature-chat/] APP --> FR[feature-roleplay/] APP --> FS[feature-settings/] FC --> CD FC --> CUI FC --> RGGUF FR --> CD FR --> CUI FR --> RGGUF FS --> CD FS --> CUI CD --> CDOM[core-domain/] RGGUF --> CDOM subgraph Native["Native (CMake)"] RGGUF --> LLAMA[llama.cpp/] RGGUF --> MTMD[tools/mtmd/] end subgraph External["External Models"] MODEL[Gemma 4 E2B
3.1GB GGUF] MMPROJ[Vision mmproj
941MB] VOSK_M[Vosk Model
40MB] KITTEN[KittenTTS
23MB ONNX] end MODEL --> RGGUF MMPROJ --> RGGUF VOSK_M --> CD KITTEN --> CD style APP fill:#242019,stroke:#8AAEC4,color:#E8DDD0 style Native fill:#242019,stroke:#D4847A,color:#E8DDD0 style External fill:#242019,stroke:#7AAE8E,color:#E8DDD0

Key Technical Decisions

DecisionWhyTrade-off
Q4_K_M quantizationBest quality/speed ratio for 3B models on mobile~3.1GB file size, requires 4GB+ RAM
llama.cpp over ONNX RuntimeNative ARM64 SIMD, Vulkan GPU, chat template supportComplex CMake build, JNI bridge required
7 ARM64 library variantsMaximize performance on each SoCLarger APK (+40MB per variant)
Vosk over Android SpeechRecognizerFully offline, no Google dependency, streaming40MB model download, slightly lower accuracy
KittenTTS over system TTSConsistent voice, no cloud dependency, ONNX23MB bundled, lower quality than cloud TTS
DuckDuckGo for web searchNo API key required, works on AndroidHTML scraping is fragile
mmproj as separate downloadVision is optional, saves 941MB for text-only usersExtra download step for vision users

Impact & Accessibility

Solace is designed for the places that need it most:

What We Learned

  1. ARM64 SIMD matters more than anything. The difference between the universal library and the I8MM+SVE variant is 8x on quantized models. This is the single biggest optimization lever.
  2. Batch size tuning is critical. Default llama.cpp batch sizes are designed for servers. Mobile needs different ratios - larger nBatch for prompt processing, smaller nUbatch for memory-constrained generation.
  3. GGUF metadata is your friend. Reading the chat template and context size from the GGUF header prevents a whole class of runtime errors.
  4. The mmproj pipeline works. Gemma 4's multimodal support via mtmd is production-ready on Android - we didn't need to modify any C++ code for vision, just wire the JNI bridge.
  5. Mental health AI needs guardrails. Crisis keyword detection, helpline numbers, and the system prompt design were as important as the technical implementation.

Build It Yourself

git clone --recurse-submodules https://github.com/HenshinLabs/solace-gemma4good.git
cd solace-gemma4good
echo "sdk.dir=/path/to/Android/Sdk" > local.properties
./gradlew assembleDebug
# Output: app/build/outputs/apk/debug/Solace-v2.0.5-debug.apk

Full documentation: docs/

Links

Built with Gemma 4

This project was developed with the assistance of Gemma 4 31B IT running on an NVIDIA A100 80GB GPU as a coding assistant. The same Gemma 4 family - the E2B variant - powers the on-device inference in the final application.