How to Run Ollama as a Linux Service with systemd

Running Ollama manually with ollama serve works fine during development, but on a server you want Ollama to start automatically at boot, restart on failure, and run as a background service without a persistent terminal session. systemd — the init system used by Ubuntu, Debian, Fedora, and most modern Linux distributions — handles all of this with a simple service unit file. This guide covers the complete setup for running Ollama as a systemd service, including GPU support, environment variable configuration, and monitoring.

Creating the systemd Service File

If you installed Ollama with the official install script (curl -fsSL https://ollama.com/install.sh | sh), the installer already creates a systemd service. Check if it is running:

systemctl status ollama
# Should show: active (running)

If you installed Ollama manually (downloaded the binary directly), create the service file yourself:

sudo nano /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Service
After=network-online.target
Wants=network-online.target

[Service]
Type=exec
User=ollama
Group=ollama
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=3

# Environment variables
Environment="HOME=/usr/share/ollama"
Environment="OLLAMA_MODELS=/usr/share/ollama/.ollama/models"

[Install]
WantedBy=default.target
# Create a dedicated ollama user (recommended)
sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama

# Verify
sudo systemctl status ollama

Adding GPU Support (NVIDIA)

For NVIDIA GPU access from a systemd service, the service user needs access to the GPU devices. Add the user to the video and render groups:

sudo usermod -aG video ollama
sudo usermod -aG render ollama

For NVIDIA, also add the environment variable to point to the CUDA libraries in the service file:

[Service]
# ... existing config ...
Environment="LD_LIBRARY_PATH=/usr/local/lib:/usr/lib"
Environment="CUDA_VISIBLE_DEVICES=0"  # Use GPU 0; remove to use all GPUs
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Verify GPU is detected
sudo journalctl -u ollama -n 20 | grep -i cuda

Key Environment Variables

Customise Ollama’s behaviour by adding Environment= lines to the [Service] section:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"      # Listen on all interfaces (for remote access)
Environment="OLLAMA_MODELS=/data/ollama/models"  # Custom model storage path
Environment="OLLAMA_KEEP_ALIVE=1h"            # Keep models loaded 1 hour
Environment="OLLAMA_MAX_LOADED_MODELS=2"      # Allow 2 models in memory simultaneously
Environment="OLLAMA_NUM_PARALLEL=2"           # Allow 2 concurrent requests
Environment="OLLAMA_FLASH_ATTENTION=1"        # Enable flash attention (faster on compatible GPUs)

After editing the service file, always reload systemd and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Managing the Service

# Start / stop / restart
sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama

# Enable / disable autostart at boot
sudo systemctl enable ollama
sudo systemctl disable ollama

# Check status
sudo systemctl status ollama

# View logs (live)
sudo journalctl -u ollama -f

# View last 50 log lines
sudo journalctl -u ollama -n 50

# View logs since last boot
sudo journalctl -u ollama -b

# View logs with timestamps
sudo journalctl -u ollama --since '1 hour ago'

Pulling Models at Service Start

To automatically pull required models when the service starts, create a separate oneshot service that runs after Ollama is ready:

sudo nano /etc/systemd/system/ollama-setup.service
[Unit]
Description=Pull required Ollama models
After=ollama.service
Requires=ollama.service

[Service]
Type=oneshot
User=ollama
ExecStart=/bin/bash -c 'until curl -sf http://localhost:11434/ > /dev/null; do sleep 1; done; /usr/local/bin/ollama pull llama3.2; /usr/local/bin/ollama pull nomic-embed-text'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable ollama-setup
# Models will be pulled on next boot (or run manually with: sudo systemctl start ollama-setup)

Storing Models on a Separate Drive

Model files are large (4–70GB each). On servers with a small OS drive and a separate data drive, store models on the larger partition:

# Create directory on data drive
sudo mkdir -p /data/ollama/models
sudo chown -R ollama:ollama /data/ollama

# Set in service file
Environment="OLLAMA_MODELS=/data/ollama/models"

Checking Service Health

# Is Ollama responding?
curl -s http://localhost:11434/ && echo 'OK' || echo 'DOWN'

# Which models are loaded in memory?
curl -s http://localhost:11434/api/ps | python3 -m json.tool

# List available models
curl -s http://localhost:11434/api/tags | python3 -c "
import json,sys
for m in json.load(sys.stdin)['models']:
    print(f"{m['name']:45} {m['size']/1e9:.1f}GB")"

Troubleshooting Common Issues

Service starts but Ollama not responding: Check the logs with journalctl -u ollama -n 50. Look for port conflicts (another process on 11434) or permission errors. GPU not detected: Verify the ollama user is in the video and render groups (id ollama) and that the NVIDIA drivers are loaded (nvidia-smi). Models not found after restart: Check that OLLAMA_MODELS points to the correct path and the ollama user has read/write permissions on that directory. Service fails to start: Run sudo -u ollama /usr/local/bin/ollama serve manually to see the error output directly without systemd’s logging layer.

Why Run Ollama as a Service

Running Ollama manually with ollama serve in a terminal session has two practical problems for server deployments. First, the process dies when the terminal session ends — SSH logout, connection drop, or accidental terminal close all terminate Ollama, which means any applications depending on it stop working until someone logs back in and restarts it manually. Second, it does not start automatically when the server reboots — planned maintenance, power events, or kernel updates all require manual intervention to restore the service. A systemd service eliminates both problems: Ollama starts at boot, runs in the background with no terminal session required, and restarts automatically if it crashes.

The Restart=always and RestartSec=3 lines in the service file are particularly important for reliability. Restart=always tells systemd to restart Ollama regardless of how it exited — whether it crashed with a non-zero exit code, was killed by the OOM killer due to memory pressure, or exited cleanly. RestartSec=3 adds a 3-second delay before each restart attempt, which prevents a fast restart loop if there is a persistent startup failure (such as a missing file or invalid configuration). Together these settings give Ollama the same reliability profile as a production web server — it will recover from most transient failures automatically without human intervention.

The Dedicated User Pattern

Creating a dedicated ollama system user rather than running as root or your personal account follows the principle of least privilege. The ollama user has access to only the resources it needs: the binary, the model storage directory, and the GPU devices. It cannot read your home directory, modify system files, or perform other actions unrelated to running inference. This matters for server security — if a vulnerability in Ollama were ever discovered that allowed code execution, the impact would be limited to what the ollama user can access rather than the full permissions of a root or admin user.

The system user created by useradd -r is a service account: it has no login shell (-s /bin/false), no interactive login capability, and a home directory only at the specified path. This is the standard pattern for service accounts on Linux — the same approach used for nginx, postgresql, redis, and other system services. It integrates naturally with Linux’s permission model and makes the Ollama deployment consistent with other services on the same server.

Configuring for Remote Access

By default, Ollama listens only on localhost (127.0.0.1:11434), which means it is only accessible from the same machine. For a team server where multiple developers connect remotely, set OLLAMA_HOST=0.0.0.0:11434 to listen on all network interfaces. Important security consideration: this makes Ollama accessible to anyone who can reach your server’s IP address on port 11434, with no authentication. For internal network deployments (office LAN, VPN-accessible server, private cloud instance), this is acceptable if your network controls access appropriately. For public-facing servers, either firewall port 11434 to specific IP ranges, or add a reverse proxy with authentication in front of Ollama rather than exposing the raw API directly.

# Check what Ollama is currently listening on
ss -tlnp | grep 11434
# or
netstat -tlnp | grep 11434

# Test remote access from another machine
curl http://YOUR_SERVER_IP:11434/

Monitoring the Service with Basic Scripting

For servers where you want simple monitoring without a full Prometheus/Grafana stack, a shell script that checks Ollama’s health and sends an alert if it is down is sufficient for most small deployments:

#!/bin/bash
# /usr/local/bin/check_ollama.sh
if ! curl -sf http://localhost:11434/ > /dev/null 2>&1; then
  echo "Ollama is DOWN at $(date)" | mail -s "Ollama Alert" admin@example.com
  # Or use a webhook: curl -X POST https://hooks.slack.com/... -d '{"text":"Ollama is down"}'
fi
# Add to cron for every 5 minutes
crontab -e
# Add: */5 * * * * /usr/local/bin/check_ollama.sh

For more sophisticated monitoring, Ollama exposes metrics-friendly endpoints: /api/ps for currently loaded models and VRAM usage, and / for a basic health check. A simple monitoring script can poll these endpoints, write the results to a log file, and alert on anomalies — loaded model count dropping to zero, response time increasing above a threshold, or repeated restart events in the systemd journal.

Log Rotation

systemd’s journal handles log rotation automatically — old logs are pruned based on the configured size and time limits (/etc/systemd/journald.conf). The default configuration retains logs for up to one month or until the journal reaches 10% of the filesystem. For Ollama deployments that process many requests, this is usually sufficient. If you need longer retention or want to send logs to a centralised logging system, use journald‘s forwarding configuration or add a log sink in the service’s [Service] section to append to a file in addition to the journal.

Upgrading Ollama

Upgrading Ollama when running as a systemd service is straightforward because models are stored separately from the binary:

# Stop the service
sudo systemctl stop ollama

# Download and install the new binary
curl -fsSL https://ollama.com/install.sh | sudo sh

# Start the service again (models are unchanged)
sudo systemctl start ollama
sudo systemctl status ollama

Model files in the OLLAMA_MODELS directory are not touched by the installer — they persist across upgrades. The service file from a manual setup also persists, though it is worth reviewing the latest Ollama documentation for any new environment variables or configuration options introduced in the upgraded version.

Running Multiple Ollama Instances

For advanced setups — serving different model families on different GPUs, or running a testing instance alongside a production instance — you can run multiple Ollama services with different configurations. Copy the service file with a different name and override the port and GPU assignment:

sudo cp /etc/systemd/system/ollama.service /etc/systemd/system/ollama-gpu1.service
sudo nano /etc/systemd/system/ollama-gpu1.service
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11435"      # Different port
Environment="CUDA_VISIBLE_DEVICES=1"         # Different GPU
Environment="OLLAMA_MODELS=/data/ollama-gpu1/models"
sudo systemctl daemon-reload
sudo systemctl enable ollama-gpu1
sudo systemctl start ollama-gpu1

Each instance runs independently on its own port with its own model storage. Applications can target different instances for different workloads — for example, routing coding requests to the instance running a code-specific model on one GPU and general chat requests to the instance running a general model on another GPU.

The Value of a Properly Configured Service

The difference between Ollama running in a terminal and Ollama running as a properly configured systemd service is the difference between a development setup and a production deployment. The service setup described in this article provides automatic startup, crash recovery, structured logging, GPU access, clean process isolation, and an upgrade path — the same properties you would expect from any production service. The configuration is not complex — a few dozen lines in a service file — but it transforms Ollama from a tool you run when you need it into infrastructure that reliably serves applications and team members without manual intervention. For any deployment beyond personal development use, this setup is the right foundation to build on.

Comparing Service Deployment Options

systemd is not the only way to run Ollama as a background service on Linux. Docker Compose is an excellent alternative for deployments where container isolation, easy updates via image pulls, and integration with other containerised services matter more than a native installation. The choice typically comes down to your server’s existing infrastructure: if you already manage services with systemd (nginx, postgresql, redis), a native systemd Ollama service fits naturally into that pattern and uses the same tools you already know for status checks, log viewing, and service management. If you already use Docker Compose for other services, the Docker Compose approach from the Ollama Docker article on this blog gives you consistency with your existing container infrastructure and the same update workflow you use for other images.

For bare-metal servers dedicated primarily to running Ollama — a home AI server or a team inference machine — the native systemd approach is typically preferable because it eliminates the Docker overhead, gives better GPU performance (no container runtime between the GPU drivers and the inference process), and integrates directly with the operating system’s service management. For cloud VMs or servers running mixed workloads where isolation and portability matter, Docker Compose is the more flexible option. Both approaches provide the same core benefit — Ollama runs reliably in the background with automatic restart — but their operational characteristics differ in ways that matter depending on your deployment context.

Getting Started

If the official Ollama installer has already created a systemd service on your machine, the setup is already complete — verify with systemctl status ollama and customise the environment variables as needed. If you installed manually, the service file in this article takes about five minutes to create and configure, and the result is an Ollama instance that starts automatically, recovers from failures, and integrates with your standard Linux service management tools. The one-time investment in proper service configuration pays back immediately and compounds over time as the server runs without requiring manual restarts or monitoring. The service unit patterns in this article also transfer directly to other local AI tools you might run alongside Ollama — a Tabby coding server, a Whisper transcription service, or a custom inference backend can all be managed with the same systemd approach, giving you a consistent operational pattern for your entire local AI infrastructure stack. Once you have the pattern down for one service, replicating it for others takes minutes rather than starting from scratch each time — and your server becomes more reliable with each service you bring under proper systemd management.

Leave a Comment