How to Run Ollama on a Raspberry Pi or ARM Device

Ollama runs on ARM-based hardware — Raspberry Pi 4/5, NVIDIA Jetson, Apple Silicon Macs, and other ARM Linux devices. On a Raspberry Pi 5 with 8GB RAM you can run small models (1–3B parameters) at a usable 3–8 tokens per second, enough for offline assistants, home automation scripts, and edge AI applications that do not need internet access. This guide covers installation, model selection, and realistic expectations for ARM deployments.

Supported ARM Devices

  • Raspberry Pi 5 (8GB) — best RPi option, 3–8 tok/s on 1–3B models
  • Raspberry Pi 4 (8GB) — usable, 2–5 tok/s on 1–3B models, tight on memory
  • Raspberry Pi 4 (4GB) — limited; only sub-2B models run comfortably
  • NVIDIA Jetson Orin — GPU-accelerated ARM, much faster (30–60 tok/s on 7B)
  • Orange Pi, Rock Pi, etc. — similar to RPi 4/5 depending on RAM
  • Apple Silicon Mac (M1/M2/M3/M4) — fast unified memory, already covered in other articles

Installation on Raspberry Pi OS / ARM Linux

# Same installer as x86 Linux — detects ARM automatically
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Pull a small model appropriate for RPi RAM
ollama pull qwen2.5:1.5b   # ~1GB, runs on 4GB RPi
ollama pull llama3.2:1b     # ~0.7GB, very fast
ollama pull gemma3:1b       # ~0.8GB, good quality for size

# Run interactively
ollama run qwen2.5:1.5b

Model Selection for ARM

The key constraint is RAM. On a Raspberry Pi, the model must fit in RAM along with the OS (roughly 1–1.5GB overhead). Practical limits:

  • Raspberry Pi 4/5 4GB: models up to ~2.5GB — 1B models comfortably, 1.5B tight
  • Raspberry Pi 5 8GB: models up to ~6.5GB — 3B models comfortably, 3.5B possible
  • Jetson Orin 8GB: GPU memory — 7B Q4_K_M runs well with hardware acceleration

Recommended models for Raspberry Pi 5 (8GB):

ollama pull qwen2.5:3b        # Best quality in 3B tier
ollama pull llama3.2:3b       # Good general purpose
ollama pull phi4-mini          # Microsoft's compact model, strong reasoning
ollama pull moondream2         # Vision model, sub-2B
ollama pull nomic-embed-text   # Embeddings for local RAG

Running as a systemd Service on Raspberry Pi

# Ollama installer creates the service automatically
systemctl status ollama

# If not, create it (same as the Linux systemd article)
# Increase default memory limit for large models
sudo systemctl edit ollama

# Add:
# [Service]
# Environment="OLLAMA_KEEP_ALIVE=1h"
# LimitNOFILE=65536

sudo systemctl daemon-reload && sudo systemctl restart ollama

Performance Optimisation for ARM

# Use smaller quantisation to fit more in RAM
ollama pull qwen2.5:3b-instruct-q4_K_M   # default, good balance
ollama pull qwen2.5:3b-instruct-q2_K     # smaller, faster, lower quality

# Set context window to minimum needed (saves RAM)
cat > Modelfile << 'EOF'
FROM qwen2.5:3b
PARAMETER num_ctx 2048
EOF
ollama create qwen3b-small -f Modelfile

# Reduce threads to avoid thermal throttling on RPi
export OLLAMA_NUM_THREAD=4  # RPi5 has 4 cores
ollama serve

Practical Use Cases on Raspberry Pi

Offline home assistant: A small model running on a Pi can answer questions, control smart home devices via scripts, and process voice commands with Whisper — entirely without internet. Useful for privacy-conscious setups or homes with unreliable connectivity.

Local RAG for documents: nomic-embed-text runs well on a Pi 5. Combine it with a small chat model and a simple vector store to build a private document assistant that runs on low-power hardware 24/7 without the cost of a full GPU server.

Edge AI for IoT: Scripts that run on the Pi can call local Ollama for classification, anomaly detection, or natural language processing of sensor data — keeping all processing local with no cloud dependency and sub-10ms network latency to the LLM.

Teaching and prototyping: A Pi with Ollama is an affordable, self-contained AI development environment. Students and developers can experiment with local LLMs without GPU hardware, learning the API and building integrations before scaling to more capable hardware.

Realistic Expectations

A Raspberry Pi 5 running a 3B model generates text at roughly 3–8 tokens per second — around 4–10 words per second. This is noticeably slow for interactive chat (a 100-word response takes 10–25 seconds) but perfectly adequate for batch processing, offline scripting, and use cases where response time is less critical than cost and power consumption. The Pi 5 draws about 5–8 watts during inference, compared to 150–350 watts for a desktop GPU. For 24/7 operation over a year, this difference is hundreds of kilowatt-hours and meaningful in both energy cost and heat output. For the right use cases — lightweight, always-on, offline AI that does not need to be fast — the Pi is a genuinely compelling deployment target.

Why ARM and Local AI Are a Natural Fit

The rise of capable small language models (1B–3B parameter range) has opened a deployment tier that simply did not exist two years ago. In 2023, a useful local LLM required at minimum 7–8B parameters and 6–8GB of VRAM — far beyond what embedded ARM hardware could provide. By 2026, models in the 1B–3B range from Qwen, Gemma, and Llama achieve quality that is genuinely useful for many tasks, not merely as a proof of concept. Qwen2.5 3B, for instance, produces coherent multi-turn conversations, handles basic coding tasks, and generates useful summaries — on a Raspberry Pi 5 with 8GB RAM.

The combination of improved model efficiency and Ollama’s clean API means that the same Python or JavaScript code you run against a 7B model on a desktop GPU runs unchanged against a 3B model on a Pi, just slower. This portability makes ARM a natural development and deployment target: prototype on more capable hardware, deploy to the Pi for production use cases where cost and power consumption matter more than speed.

Setting Up a Headless Pi AI Server

For a Pi that runs as a dedicated AI server — accepting requests from other devices on your local network — the setup is straightforward:

# 1. Install Ollama (runs as systemd service automatically)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Configure to listen on all interfaces
sudo systemctl edit ollama
# Add:
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Environment="OLLAMA_KEEP_ALIVE=24h"

sudo systemctl daemon-reload && sudo systemctl restart ollama

# 3. Pull models
ollama pull qwen2.5:3b
ollama pull nomic-embed-text

# 4. Test from another device on your network
curl http://PI_IP_ADDRESS:11434/api/tags

With this setup, any device on your local network (phone, laptop, desktop) can call the Pi’s Ollama API. This is useful for running a central AI server on low-power hardware that stays on 24/7, while client devices that may be turned off or away from the network call it on demand. The Pi’s 5–8W draw makes it economical to leave running continuously.

Running Open WebUI on a Raspberry Pi

Open WebUI runs on a Pi 5, though it requires Docker and is noticeably slower to load than on desktop hardware. For a more lightweight web interface, consider running Ollama’s API directly or using a simpler chat frontend. If you do run Open WebUI on the Pi:

# Ensure Docker is installed
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Run Open WebUI (the arm64 image)
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

The Open WebUI image is multi-architecture and runs on arm64 without modification. Expect slower initial load times compared to desktop hardware, but once loaded the interface works normally for the actual AI interaction (which is handled by Ollama, not Open WebUI).

Monitoring Temperature and Throttling

The Raspberry Pi 5 throttles CPU clock speed when it reaches high temperatures — typically above 80°C. During sustained LLM inference, which keeps the CPU at high utilisation, a Pi without active cooling can throttle and reduce inference speed by 20–40%. A heatsink case or active cooling fan is strongly recommended for any deployment that runs inference continuously:

# Monitor temperature
watch -n 2 vcgencmd measure_temp

# Check for throttling events
vcgencmd get_throttled
# 0x0 = no throttling, non-zero = throttling occurred

# Check current CPU frequency
cpu=$(vcgencmd get_clock arm)
echo "CPU: $((cpu/1000000)) MHz"

If you see sustained throttling, improve cooling before deploying for production use — throttling under load means your actual production performance will be worse than your benchmarks, which is a frustrating discovery after deployment. The official Raspberry Pi active cooler (a fan-heatsink combo) costs about $5 and completely eliminates thermal throttling under sustained load.

Benchmarking Your Setup

import ollama
import time

def benchmark(model: str, prompt: str = 'Write a haiku about Raspberry Pi.') -> dict:
    start = time.time()
    response = ollama.chat(
        model=model,
        messages=[{'role':'user','content':prompt}],
        stream=False
    )
    elapsed = time.time() - start
    tokens = response.get('eval_count', 0)
    return {
        'model': model,
        'tokens': tokens,
        'seconds': round(elapsed, 2),
        'tok_per_sec': round(tokens / elapsed, 1)
    }

for model in ['qwen2.5:1.5b', 'qwen2.5:3b', 'llama3.2:1b']:
    try:
        result = benchmark(model)
        print(f"{result['model']:25} {result['tok_per_sec']:5.1f} tok/s ({result['tokens']} tokens in {result['seconds']}s)")
    except Exception as e:
        print(f"{model}: not available ({e})")

Run this after setup to establish your Pi’s actual performance baseline. Results vary significantly between Pi generations, available RAM, cooling quality, and the specific model and quantisation. Your measured numbers are more reliable than any published benchmark for predicting how your specific setup will perform in production.

Comparing ARM Tiers: Pi 4 vs Pi 5 vs Jetson

If you are choosing between ARM devices for a local AI deployment, the differences are significant. The Raspberry Pi 4 (8GB) is adequate for light inference use — occasional queries, batch processing that runs overnight — but its slower CPU and memory bandwidth make it noticeably sluggish compared to the Pi 5. For new deployments, the Pi 5 is strongly preferred. The price difference is modest and the performance improvement (roughly 2–3x in inference speed) is meaningful for interactive use.

The NVIDIA Jetson Orin Nano (8GB) is in a different class — it has a dedicated GPU with CUDA support, allowing Ollama to run with GPU acceleration. A 7B model on a Jetson Orin runs at 30–60 tokens per second, which is genuinely fast enough for interactive use. The Jetson costs significantly more than a Raspberry Pi (roughly 10–15x more for the Orin Nano), but if you need GPU-class performance in an embedded form factor for production use, it is the right tool. For hobbyist use and experimentation, the Pi 5 hits the right price-performance point.

Ollama on ARM: The Future Direction

ARM’s share of the computing landscape is growing rapidly — not just in embedded devices but in cloud instances (AWS Graviton, Azure Ampere), development machines (Apple Silicon Macs), and enterprise servers. Ollama’s support for ARM64 means local AI development is not siloed to x86 hardware. Code you write and test on a Raspberry Pi runs on an M3 MacBook, an AWS Graviton instance, and an x86 desktop without modification. This cross-architecture consistency is one of Ollama’s less celebrated but practically important qualities — it removes architecture from the list of variables you need to manage when deploying local AI applications across different environments. The trend toward more powerful ARM devices (the Pi 5 is significantly faster than the Pi 4; the next generation will be faster still) suggests that ARM edge AI deployments will become increasingly practical as model quality at small parameter counts continues to improve.

When to Move Beyond the Pi

The Raspberry Pi is the right deployment target when power consumption and cost are priorities and inference speed is not critical. It is the wrong target when you need fast interactive responses, large context windows, multimodal capabilities, or models larger than 3B parameters. The natural upgrade path is an Apple Silicon Mac with 16–32GB unified memory (which handles 7–13B models at GPU speeds) or a machine with a discrete NVIDIA GPU. Think of the Pi as the right tool for edge AI use cases — always-on, low-power, offline — and desktop/server hardware as the right tool for interactive, high-quality, or computationally intensive use cases. Understanding where the Pi fits in this spectrum prevents both under-using it (dismissing it because it cannot run 70B models) and over-deploying it (building latency-sensitive applications on hardware that will frustrate users with slow responses). The sweet spot is offline, batch, or non-interactive AI tasks where the Pi’s efficiency and always-on economics genuinely justify the speed trade-off compared to faster hardware that costs ten times as much to run continuously.

Getting Started

The fastest path to a running Pi AI server: flash a fresh Raspberry Pi OS Lite (64-bit) to an SD card or NVMe, boot the Pi with SSH enabled, install Ollama with the one-line installer, pull qwen2.5:3b and nomic-embed-text, and add a basic Python script that calls the API. With a good SD card or NVMe drive (faster storage reduces model load times significantly — an NVMe on Pi 5 is strongly recommended over SD card for AI workloads), you can have a working headless AI server in under an hour. The Pi 5 with NVMe storage, good cooling, and 8GB RAM is a capable edge AI platform that costs around $100 in hardware and $5 per year in electricity to run continuously — a compelling package for the right use cases. As ARM devices continue to improve in performance and the small model tier continues to close the quality gap with larger models, the case for edge AI deployments on devices like the Raspberry Pi will only grow stronger over the next few years — and Ollama’s ARM support means you can be ready to take advantage of those improvements without any changes to your application code when better hardware becomes available. That future compatibility is a genuine advantage worth considering when choosing a software stack for any embedded AI project today.

Leave a Comment