Ollama Keep-Alive and Model Preloading: Eliminate Cold Start Latency

By default Ollama unloads a model from memory 5 minutes after the last request. The next request then pays a cold-start penalty — typically 3–10 seconds for a 7B model — while the weights reload from disk. For interactive applications this latency is noticeable. Understanding Ollama’s keep-alive setting, how to pre-load models, and how to manage multiple models in memory simultaneously eliminates this problem entirely.

How Keep-Alive Works

Keep-alive controls how long Ollama holds a model in GPU/RAM after the last request. The default is 5 minutes (5m). After that timer expires, the model is unloaded and VRAM is freed. The next request triggers a cold start. Keep-alive can be set per-request or globally via environment variable.

Setting Keep-Alive Per Request

Pass a keep_alive field in the request body to override the default for that session:

# Keep model loaded for 30 minutes after this request
curl http://localhost:11434/api/chat \
  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello"}],"keep_alive":"30m"}'

# Keep loaded indefinitely (until manually unloaded or Ollama restarts)
curl http://localhost:11434/api/chat \
  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello"}],"keep_alive":-1}'

# Unload immediately after this request (free VRAM right away)
curl http://localhost:11434/api/chat \
  -d '{"model":"llama3.2","messages":[{"role":"user","content":"Hello"}],"keep_alive":0}'
import requests

def chat_keep_alive(model, messages, keep_alive='30m'):
    return requests.post('http://localhost:11434/api/chat',
        json={'model':model,'messages':messages,'stream':False,'keep_alive':keep_alive}
    ).json()['message']['content']

# Application startup: load model and keep it indefinitely
chat_keep_alive('llama3.2', [{'role':'user','content':' '}], keep_alive=-1)
print('Model pre-warmed and will stay loaded')

Setting Keep-Alive Globally

Set the OLLAMA_KEEP_ALIVE environment variable before starting Ollama to change the default for all models:

# Set globally — models stay loaded 1 hour by default
OLLAMA_KEEP_ALIVE=1h ollama serve

# Or export it permanently in your shell profile
export OLLAMA_KEEP_ALIVE=1h

# Keep all models loaded indefinitely
OLLAMA_KEEP_ALIVE=-1 ollama serve

In Docker Compose, add it to the environment section:

services:
  ollama:
    image: ollama/ollama
    environment:
      - OLLAMA_KEEP_ALIVE=1h
      - OLLAMA_NUM_PARALLEL=2

Pre-Loading Models at Startup

Pre-loading ensures a model is in memory and ready before the first real request arrives. Send a minimal request with keep_alive: -1 immediately after Ollama starts:

import requests
import time

OLLAMA = 'http://localhost:11434'

def wait_for_ollama(timeout=30):
    for _ in range(timeout):
        try:
            requests.get(OLLAMA, timeout=1)
            return True
        except:
            time.sleep(1)
    return False

def preload_model(model, keep_alive=-1):
    """Load model into memory. keep_alive=-1 means indefinite."""
    requests.post(f'{OLLAMA}/api/generate',
        json={'model':model,'prompt':' ','stream':False,'keep_alive':keep_alive})
    print(f'{model} loaded and ready')

if wait_for_ollama():
    preload_model('llama3.2')
    preload_model('nomic-embed-text')
else:
    print('Ollama did not start in time')

Running Multiple Models Simultaneously

By default Ollama loads one model at a time — switching between models unloads the previous one first. To run multiple models in memory simultaneously, set OLLAMA_MAX_LOADED_MODELS:

# Allow up to 3 models in memory at once
export OLLAMA_MAX_LOADED_MODELS=3
ollama serve
# Docker Compose
environment:
  - OLLAMA_MAX_LOADED_MODELS=3
  - OLLAMA_KEEP_ALIVE=2h

With multiple loaded models, you can pre-load different models for different tasks and switch between them with zero cold-start latency. The constraint is total available VRAM — each loaded model consumes its full KV cache allocation even when idle. On a 24GB GPU, loading both a 7B chat model (~5GB) and a 7B embedding model (~5GB) simultaneously leaves 14GB for the OS and headroom, which is comfortable.

Inspecting What Is Currently Loaded

# See all loaded models, VRAM usage, and expiry time
curl http://localhost:11434/api/ps
def show_loaded_models(base='http://localhost:11434'):
    models = requests.get(f'{base}/api/ps').json().get('models',[])
    if not models:
        print('No models currently loaded')
        return
    for m in models:
        vram = m.get('size_vram',0)/1e9
        expires = m.get('expires_at','indefinite')[:19]
        print(f"{m['name']:45} {vram:.1f}GB  expires: {expires}")

show_loaded_models()

Manually Unloading a Model

To free VRAM immediately without waiting for the keep-alive timer, send a request with keep_alive: 0:

def unload_model(model):
    requests.post('http://localhost:11434/api/generate',
        json={'model':model,'prompt':' ','stream':False,'keep_alive':0})
    print(f'{model} unloaded from memory')

unload_model('llama3.2')

Recommended Settings for Common Scenarios

Interactive chat application: Set OLLAMA_KEEP_ALIVE=1h globally and pre-load your primary model at startup with keep_alive=-1. Users get instant responses and the model stays warm between conversations.

Batch processing pipeline: Use the default 5m keep-alive. The first batch request pays the load cost; subsequent requests in the same batch are fast. Between batches the model unloads, freeing VRAM for other tasks.

Multi-model RAG pipeline: Set OLLAMA_MAX_LOADED_MODELS=2 and pre-load both the chat model and the embedding model with keep_alive=-1. Both stay in memory permanently, eliminating cold starts for either.

Low VRAM machine (8GB): Keep the default 5m keep-alive and only pre-load one model. Trying to keep two 7B models loaded simultaneously will cause OOM errors on 8GB VRAM — let Ollama swap between them on demand instead.

Keep-Alive vs System RAM vs VRAM

Keep-alive holds the model in the fastest available memory tier — GPU VRAM if available, then system RAM, then it pages to disk. When VRAM is full and a new model is requested, Ollama evicts the model that has been idle longest (LRU eviction). Setting OLLAMA_MAX_LOADED_MODELS too high on a machine with limited VRAM causes models to be paged to system RAM, where inference runs at 5–10x lower speed. Check /api/ps to see the size_vram field — if it is much lower than the model’s total size, the model is partially paged to system RAM and you should reduce OLLAMA_MAX_LOADED_MODELS.

Why Cold Start Latency Matters More Than You Think

A 5–10 second cold start on first request sounds minor, but it has an outsized effect on user experience in interactive applications. If a user opens a chat interface and sends their first message, a 10-second wait before seeing any response feels broken, not slow. Research on web application UX consistently shows that users perceive waits above 3 seconds as failures rather than slowness. For developers building applications on top of Ollama — internal tools, chat interfaces, RAG pipelines — eliminating cold-start latency is one of the highest-return-on-investment optimisations available, and it costs nothing in hardware or quality.

The cold start also affects the first request after a period of inactivity, not just after Ollama restarts. If a user steps away for 10 minutes during a work session and then comes back to continue chatting, the model has unloaded and the next message pays the reload penalty. From the user’s perspective, the interface was fast and then suddenly became slow for no apparent reason — a confusing experience that erodes trust in the tool. Setting keep-alive to match typical session pause lengths (30 minutes to 1 hour for most interactive applications) makes this invisible.

The Load Duration Signal

Every Ollama response includes a load_duration field in the performance statistics. This tells you exactly how long the model took to load for that specific request — zero on warm requests, several seconds on cold starts. Logging this field in your application gives you visibility into when cold starts are occurring and how long they take, which is the foundation for deciding whether your current keep-alive setting is appropriate for your usage pattern.

import requests
import json
from collections import defaultdict

stats = defaultdict(list)

def chat_and_track(model, messages):
    r = requests.post('http://localhost:11434/api/chat',
        json={'model':model,'messages':messages,'stream':False})
    d = r.json()
    ns = 1e9
    load_sec = d.get('load_duration',0)/ns
    stats[model].append(load_sec)
    if load_sec > 1.0:
        print(f'WARNING: Cold start for {model} took {load_sec:.1f}s')
    return d['message']['content']

def cold_start_report():
    for model, times in stats.items():
        cold = [t for t in times if t > 0.5]
        print(f'{model}: {len(cold)}/{len(times)} cold starts, '
              f'avg cold: {sum(cold)/max(len(cold),1):.1f}s')

chat_and_track('llama3.2',[{'role':'user','content':'Hello'}])
cold_start_report()

Tracking cold starts in production gives you the data to tune keep-alive settings for your specific usage pattern. If you see cold starts only on the very first daily request (morning login pattern), a keep-alive of 12–24 hours or indefinite loading makes sense. If cold starts are scattered throughout the day, the default 5 minutes is too short and a 1–2 hour keep-alive would eliminate most of them.

Keep-Alive with Multiple Applications

When multiple applications share a single Ollama instance — for example, a chat application and a RAG pipeline both pointing at the same Ollama server — the keep-alive settings from each application can conflict. If the chat application sets keep_alive=-1 for its model and the RAG pipeline also sets keep_alive=-1 for its embedding model, both stay loaded simultaneously. But if one application sets keep_alive=0 at the end of a session, it unloads its model even while the other application might still need it loaded. Design the keep-alive strategy at the infrastructure level (via OLLAMA_KEEP_ALIVE environment variable and OLLAMA_MAX_LOADED_MODELS) rather than relying on individual applications to coordinate their keep-alive settings — this gives you consistent, predictable model residency regardless of which application makes the last request.

Memory Budget Planning

Planning which models to keep loaded requires knowing their memory footprint. As a practical guide: a 7B model at Q4_K_M quantisation uses approximately 4.5–5.5GB of VRAM for weights plus 0.5–2GB for the KV cache at typical context lengths. An embedding model like nomic-embed-text uses about 0.3GB. A 3B completion model like StarCoder2-3B uses about 2GB. On a 24GB GPU, you can comfortably keep a 7B chat model, an embedding model, and a 3B completion model loaded simultaneously with room for the KV cache and OS overhead — a complete local AI stack with zero cold starts for any component.

For the 16GB VRAM tier (RTX 3080, M2 Pro), keep-alive strategy matters more because the memory budget is tighter. A 7B chat model (5GB) plus an embedding model (0.3GB) fits comfortably. Adding a second 7B model pushes the total to 10–11GB, which still fits on 16GB with room for KV cache at moderate context lengths. Attempting three 7B models simultaneously on 16GB will cause one to be partially paged to system RAM, degrading inference speed. Use /api/ps to check size_vram vs total model size — if they differ significantly, the model is paged and you should reduce the number of simultaneously loaded models.

Startup Scripts for Production Deployments

A reliable pattern for production Ollama deployments is a startup script that waits for Ollama to be healthy, then pre-loads all required models before accepting application traffic. This eliminates cold starts entirely from the user’s perspective — by the time the first real request arrives, all models are already warm.

#!/bin/bash
# startup.sh — run before starting your application
echo 'Waiting for Ollama...'
until curl -sf http://localhost:11434/ > /dev/null; do
  sleep 1
done
echo 'Ollama ready. Pre-loading models...'
curl -s -X POST http://localhost:11434/api/generate \
  -d '{"model":"llama3.2","prompt":" ","stream":false,"keep_alive":-1}' > /dev/null
curl -s -X POST http://localhost:11434/api/embeddings \
  -d '{"model":"nomic-embed-text","prompt":" "}' > /dev/null
echo 'Models loaded. Starting application...'
exec "$@"

Run this script as the entrypoint of your Docker container or as a systemd pre-start hook, passing your application’s start command as arguments. The keep_alive: -1 in the generate request ensures the model stays loaded indefinitely, while the embeddings request (which does not accept a keep_alive field directly) triggers the model load and the OLLAMA_KEEP_ALIVE environment variable controls its residency.

Tuning Keep-Alive for Your Workflow

The right keep-alive value depends almost entirely on your usage pattern — there is no universal correct setting. For personal development use where you run Ollama on your own machine and use it throughout the workday, setting OLLAMA_KEEP_ALIVE=8h in your shell profile and adding your primary model to a startup script means it is always available during working hours without occupying VRAM overnight. For a shared team server running 24/7, indefinite keep-alive (-1) for frequently used models makes sense since the server is dedicated to Ollama and has no competing VRAM demands. For a laptop with limited VRAM that you also use for other tasks, the default 5 minutes is often appropriate — it balances availability for AI tasks with freeing VRAM for gaming, video editing, or other GPU-intensive work when you are not using Ollama.

The iterative approach works best: start with the default, track cold starts using the load_duration field for a week, identify the patterns (time-of-day gaps, multi-minute pauses between requests), and set keep-alive to cover those gaps. Most users end up at either 30 minutes (covers brief breaks without tying up VRAM for hours) or indefinite (for dedicated inference machines). The 5-minute default is a conservative choice that made sense when Ollama first launched on smaller hardware — on modern machines with 16GB+ VRAM, extending it significantly costs nothing and meaningfully improves the experience.

Keep-Alive and Energy Efficiency

Keeping a model loaded in VRAM does not actively consume GPU compute — the GPU is idle between requests, only drawing power to maintain the memory state. The power difference between having a model loaded in VRAM versus unloaded is typically 5–15 watts on a discrete GPU, which is negligible for most use cases. The exception is power-constrained environments like laptops on battery, where minimising VRAM occupancy reduces total system power draw. On battery power, letting the default 5-minute eviction handle model unloading is a reasonable trade-off between responsiveness and battery life, while on AC power there is no meaningful energy reason to prefer short keep-alive over long.

Summary: The Settings That Matter

Three environment variables and one per-request parameter cover everything you need. Set OLLAMA_KEEP_ALIVE to match your usage pattern — 30m to 1h for interactive use, -1 for dedicated servers. Set OLLAMA_MAX_LOADED_MODELS to the number of models you want simultaneously resident in VRAM, based on your memory budget. Set OLLAMA_NUM_PARALLEL only if you need concurrent request handling. And use the per-request keep_alive field to override the global default for specific use cases — sending keep_alive: 0 after a batch job frees VRAM immediately, while sending keep_alive: -1 at startup pre-warms the model before users arrive. Together these four controls give you complete, granular management of Ollama’s memory behaviour without requiring any infrastructure changes beyond environment variables.

Leave a Comment