
Run a Local LLM on Android with llama.cpp + Vulkan
Summary
Compile llama.cpp with Vulkan in Termux and run a quantized LLM on your Android GPU, no root.
Run a Local LLM on Your Android Phone with llama.cpp and Vulkan
This week r/LocalLLaMA lit up over a single screenshot: a quantized 7B model generating text at double-digit tokens per second on a mid-range Android phone, with the GPU doing the heavy lifting through Vulkan and no root access anywhere in sight. The thread blew past everything else on the subreddit because it cracks a problem people assumed needed a Snapdragon flagship or a custom ROM: real on-device inference on hardware you already own.
The trick is not a new model. It is llama.cpp compiled with its Vulkan backend inside Termux, talking to the phone's Mali or Adreno GPU through the standard libvulkan.so driver. By the end of this guide you will have llama.cpp built on your phone, a small GGUF model running fully offline, GPU layers offloaded for speed, and an OpenAI-compatible server you can hit from your laptop on the same Wi-Fi.
Everything here was checked against the official llama.cpp build docs and the community Termux + Vulkan tutorials that the viral post pointed back to. The commands are copy-paste ready.
Why this matters now
On-device inference is having a moment. Models in the 1B to 4B range (Gemma 3, Qwen 3, Llama 3.2, DeepSeek-R1 distills) are now good enough for summarization, classification, JSON extraction, and casual chat, while quantization shrinks them to 1-3 GB. The missing piece on phones was the GPU: CPU-only inference works but drains battery and stalls on prompt processing. Vulkan is the one GPU API that ships on essentially every modern Android device, so a Vulkan backend means one binary that runs on Adreno, Mali, and Xclipse alike. That is exactly why this approach went viral instead of staying a niche flagship hack.
Prerequisites
- An Android phone (Android 10+), ideally with 6 GB RAM or more. 8 GB+ is comfortable.
- A GPU that exposes a Vulkan driver. Most Adreno (Qualcomm) and Mali (most Samsung/MediaTek) GPUs do.
- Termux installed from F-Droid or GitHub, NOT the abandoned Play Store build. The Play Store version is years out of date and its package repos are broken.
- Roughly 3-4 GB of free storage for the toolchain, source, and one model.
- Basic comfort with a Linux shell. No root required.
Why F-Droid specifically: the Play Store Termux was frozen long ago and pkg against its mirrors fails on modern packages. Uninstall it first if you have it, then install the F-Droid build so the steps below resolve cleanly.
Step 1 - Give Termux storage access
Open Termux and grant storage so you can move models in and out later:
termux-setup-storage
Accept the Android permission dialog. This creates ~/storage symlinks to your Downloads and shared folders, handy for sideloading a GGUF you already downloaded with a browser.
Step 2 - Install the Vulkan toolchain and build tools
Update packages, then pull the Vulkan tooling plus the compiler stack. tur-repo is the Termux User Repository, which carries the GPU bits; shaderc compiles the GLSL compute shaders llama.cpp needs.
pkg update && pkg upgrade -y
pkg install -y tur-repo x11-repo
pkg install -y vulkan-tools vulkan-headers vulkan-loader-android shaderc
pkg install -y git cmake clang ninja
vulkan-loader-android is the piece that bridges to your device's real driver at /system/lib64/libvulkan.so. vulkan-headers gives the compiler the Vulkan API definitions. If a package name 404s, run pkg update again so the new repos are indexed.
Step 3 - Confirm the GPU is actually reachable
Do not skip this. Half the failed builds people post are devices where Vulkan never initialized. Run:
vulkaninfo | head -n 40
You want to see a real GPU id and a deviceName like Adreno (TM) 740 or Mali-G715. Example trimmed output:
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.274
GPU0:
apiVersion = 1.3.274
driverName = Adreno Vulkan Driver
deviceName = Adreno (TM) 740
deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
If vulkaninfo errors with Cannot find a compatible Vulkan ICD, your phone is missing a usable driver. On some non-Qualcomm devices you may need a Mesa Turnip/Zink wrapper from the Termux user repos; on most 2022+ phones the stock driver just works.
Step 4 - Clone and build llama.cpp with the Vulkan backend
Clone the source and the Vulkan headers, then configure CMake with the Vulkan backend turned on. The two -DVulkan_* flags point CMake at the phone's system driver and the headers you just cloned, which is the part that trips people up on Android.
cd ~
git clone https://github.com/KhronosGroup/Vulkan-Headers.git
git clone https://github.com/ggml-org/llama.cpp.git
cd ~/llama.cpp
cmake -B build \
-DGGML_VULKAN=ON \
-DVulkan_LIBRARY=/system/lib64/libvulkan.so \
-DVulkan_INCLUDE_DIR=$HOME/Vulkan-Headers/include \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
The build takes 5-15 minutes depending on your SoC and how many cores nproc reports. When it finishes, the binaries land in ~/llama.cpp/build/bin/. The two you care about are llama-cli (interactive/one-shot) and llama-server (OpenAI-compatible HTTP API).
Quick sanity check that the Vulkan backend is compiled in:
./build/bin/llama-cli --version
# look for: ggml_vulkan: Found 1 Vulkan devices
# (printed on first model load, confirming GPU offload is live)
Step 5 - Pull a model and run it on the GPU
Modern llama.cpp can fetch a GGUF straight from Hugging Face with the -hf flag, so you do not have to download by hand. It caches into the standard HF cache directory. Start with something tiny so the first run is fast:
cd ~/llama.cpp/build/bin
# -hf downloads + caches the model, then runs it
./llama-cli \
-hf ggml-org/gemma-3-1b-it-GGUF \
-ngl 99 \
-p "Explain what a tokenizer does in two sentences." \
-n 128 -no-cnv
-ngl 99 offloads as many transformer layers to the GPU as will fit (99 just means "all of them"; the runtime caps it at the real layer count). -n 128 caps generated tokens, -no-cnv runs one-shot instead of opening a chat loop, and -p is your prompt. Example output:
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 740 (Adreno Vulkan Driver)
load_tensors: offloaded 27/27 layers to GPU
A tokenizer splits raw text into smaller units called tokens (sub-words,
characters, or symbols) that a language model can process numerically.
It also maps each token to an integer ID and back, so the model can read
input and turn its predictions into readable text.
llama_perf: prompt eval = 41.2 ms / 12 tokens
llama_perf: eval time = 1180.5 ms / 64 runs ( 18.4 ms per token, 54.3 tokens/s)
Two lines tell you it worked: offloaded 27/27 layers to GPU means the whole model is on the Adreno, and the tokens/s figure is your real throughput. On a 1B model a recent Adreno or Mali commonly lands in the 20-55 tok/s range; bigger models are slower.
Step 6 - Bump up to a useful model size
Once the 1B works, step up to a 3-4B model at Q4_K_M quantization, which is the sweet spot for quality versus size on a phone. Append a quant tag after a colon to grab a specific file:
./llama-cli \
-hf bartowski/Qwen_Qwen3-4B-GGUF:Q4_K_M \
-ngl 99 -c 4096 \
-p "Summarize the plot of Romeo and Juliet for a 10-year-old." \
-n 200 -no-cnv
-c 4096 sets the context window in tokens. Bigger context uses more RAM for the KV cache, so if the process gets killed, drop it to -c 2048. If you see fewer than all layers offloaded (e.g. offloaded 20/37), your GPU memory budget is full; either lower -ngl to a number that fits or use a smaller quant like Q4_0.
Step 7 - Serve an OpenAI-compatible API to your laptop
The real payoff: turn the phone into a tiny local inference server and call it from anything that speaks the OpenAI API. Bind to all interfaces so other devices on your Wi-Fi can reach it:
./llama-server \
-hf ggml-org/gemma-3-1b-it-GGUF \
-ngl 99 -c 4096 \
--host 0.0.0.0 --port 8080
Find the phone's LAN IP with ip addr show wlan0 (look for the inet 192.168.x.x line). Then from your laptop, hit the standard chat completions endpoint:
curl http://192.168.1.42:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma",
"messages": [{"role": "user", "content": "Give me 3 startup name ideas for a plant care app."}],
"temperature": 0.7
}'
Response (trimmed) looks exactly like OpenAI's, so existing client code drops in unchanged:
{
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "1. Sprout & Co\n2. LeafLogic\n3. PocketBotanist"
},
"finish_reason": "stop"
}],
"model": "gemma",
"usage": {"prompt_tokens": 21, "completion_tokens": 18, "total_tokens": 39}
}
Because the path is /v1/chat/completions, you can point the official OpenAI Python SDK at it by setting base_url="http://192.168.1.42:8080/v1" and any dummy API key. Your phone is now a private, offline model endpoint.
Worked example: an offline note summarizer
Here is a tiny end-to-end use case. Keep llama-server running from Step 7, then from your laptop run a script that summarizes a text file using the phone's GPU. No data leaves your network.
from openai import OpenAI
client = OpenAI(base_url="http://192.168.1.42:8080/v1", api_key="local")
notes = open("meeting_notes.txt").read()
resp = client.chat.completions.create(
model="gemma",
messages=[
{"role": "system", "content": "You summarize notes into 3 bullet action items."},
{"role": "user", "content": notes},
],
temperature=0.3,
)
print(resp.choices[0].message.content)
Example input/output. Input file meeting_notes.txt:
Discussed Q3 launch. Marketing needs final copy by Friday. Eng blocked on
the payment API key. Sarah will demo the beta to the board next Tuesday.
- Marketing: deliver final launch copy by Friday
- Eng: obtain the payment API key to unblock development
- Sarah: prepare and run the beta demo for the board on Tuesday
That entire round trip ran on a phone GPU sitting on your desk. This is the pattern people get excited about: a free, private, always-available model for the boring-but-useful 80% of LLM tasks.
Common pitfalls and how to fix them
- Play Store Termux. It is deprecated and its repos are broken. Uninstall it and install from F-Droid or GitHub, or nothing in Step 2 will resolve.
- vulkaninfo says no ICD found. Your device lacks a usable Vulkan driver in the Termux environment. Re-run
pkg install vulkan-loader-android; on some Mali/older devices you need a Mesa Turnip wrapper from the user repos. If the GPU truly is not exposed, build CPU-only by dropping-DGGML_VULKAN=ONand using-ngl 0. - CMake cannot find Vulkan. Double-check the two paths.
-DVulkan_LIBRARYmust be/system/lib64/libvulkan.so(it lives in the read-only system partition, not in Termux), and-DVulkan_INCLUDE_DIRmust point at the headers you cloned. - Process killed mid-load (OOM). Android's low-memory killer reclaimed RAM. Use a smaller model or quant, lower the context with
-c 2048, or reduce GPU layers with a specific-nglnumber instead of 99. - Gibberish output on Vulkan. A known issue on a few driver/model combos. Update llama.cpp (
git pullthen rebuild), try a different quant, or temporarily run CPU-only to confirm the model file itself is fine. - Great generation speed but slow prompts. Prompt (prefill) throughput on mobile GPUs is often much lower than generation. Keep prompts short, and reuse the server so the model stays warm between requests.
- Thermal throttling. Long sessions heat the SoC and clocks drop. For sustained loads, keep the phone cool and do not expect laptop-class endurance.
Quick reference
| Flag / command | What it does |
|---|---|
| pkg install vulkan-tools vulkan-headers vulkan-loader-android shaderc | Vulkan runtime + headers + shader compiler in Termux |
| vulkaninfo | Confirm the GPU and Vulkan driver are reachable |
| -DGGML_VULKAN=ON | Compile llama.cpp's Vulkan GPU backend |
| -DVulkan_LIBRARY=/system/lib64/libvulkan.so | Point CMake at the phone's system Vulkan driver |
| -hf user/repo[:QUANT] | Download + cache a GGUF from Hugging Face and run it |
| -ngl 99 | Offload all model layers to the GPU |
| -c 4096 | Context window in tokens (lower it to save RAM) |
| -no-cnv | One-shot generation instead of a chat loop |
| llama-server --host 0.0.0.0 --port 8080 | Expose an OpenAI-compatible API on your LAN |
Next steps
- Benchmark your device with
llama-bench(also built inbuild/bin/) to compare quants and-nglsettings on your exact GPU. - Try a reasoning distill like a DeepSeek-R1 1.5B GGUF for math and logic tasks, and watch the <think> traces stream.
- Wrap
llama-serverin a Termux:Boot script so your phone becomes an always-on local endpoint. - Add a simple RAG layer: embed local notes with a small embedding GGUF and feed retrieved chunks into the prompt, all on-device.
- Experiment with KV-cache quantization (
--cache-type-k q8_0) to fit longer contexts in limited GPU memory.
The headline from this week's viral thread is simple: the phone in your pocket is now a capable, private LLM box. With llama.cpp's Vulkan backend you get GPU acceleration on commodity Android hardware, no root, no cloud, and a standard API your existing tools already understand.
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.
Join 1,955 builders reading daily.