How to Deploy Ollama on Kubernetes

Deploying Ollama on Kubernetes makes sense when you need AI inference available across a cluster, want GPU-accelerated pods managed by Kubernetes, or need to integrate Ollama with an existing Kubernetes-based infrastructure. This guide covers a production-ready Ollama Kubernetes deployment with GPU support, persistent model storage, and service exposure.

Prerequisites

  • A Kubernetes cluster with GPU nodes (NVIDIA GPU Operator installed, or CUDA-capable nodes pre-configured)
  • kubectl and helm installed locally
  • A persistent volume provisioner (for storing Ollama model files)

Basic Deployment

# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        env:
        - name: OLLAMA_HOST
          value: '0.0.0.0:11434'
        - name: OLLAMA_KEEP_ALIVE
          value: '1h'
        resources:
          limits:
            nvidia.com/gpu: '1'   # Request 1 GPU
            memory: '16Gi'
          requests:
            memory: '8Gi'
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: ai
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi  # Enough for several 7B models
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
  type: ClusterIP  # Internal access only

Deploying and Pulling Models

# Apply the deployment
kubectl apply -f ollama-deployment.yaml

# Wait for the pod to be ready
kubectl wait --for=condition=ready pod -l app=ollama -n ai --timeout=120s

# Pull a model via kubectl exec
kubectl exec -n ai deploy/ollama -- ollama pull llama3.2
kubectl exec -n ai deploy/ollama -- ollama pull nomic-embed-text

# Test it works
kubectl exec -n ai deploy/ollama -- ollama run llama3.2 "Hello" --format json

Init Container for Pre-Pulling Models

initContainers:
- name: model-puller
  image: ollama/ollama:latest
  command:
  - /bin/sh
  - -c
  - |
    ollama serve &
    sleep 5
    ollama pull llama3.2
    ollama pull nomic-embed-text
    kill %1
  env:
  - name: OLLAMA_HOST
    value: '0.0.0.0:11434'
  volumeMounts:
  - name: ollama-data
    mountPath: /root/.ollama

Exposing via Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama
  namespace: ai
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: '300'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '300'
spec:
  rules:
  - host: ollama.internal.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama
            port:
              number: 11434

CPU-Only Deployment (No GPU)

# Remove the nvidia.com/gpu limit and use small models
resources:
  limits:
    memory: '8Gi'
  requests:
    memory: '4Gi'
# Use small models: llama3.2:1b, qwen2.5:1.5b

Helm Chart

A community Helm chart for Ollama is available at otwld/ollama-helm:

helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm install ollama ollama-helm/ollama \
  --namespace ai \
  --set ollama.gpu.enabled=true \
  --set ollama.gpu.number=1 \
  --set ollama.models[0]=llama3.2 \
  --set ollama.models[1]=nomic-embed-text \
  --set persistentVolume.size=100Gi

Why Kubernetes for Ollama

Most Ollama deployments start simple — a single machine, a systemd service, done. Kubernetes becomes relevant when your infrastructure is already Kubernetes-based and adding a non-Kubernetes service creates an operational inconsistency, or when you need capabilities that Kubernetes provides: automatic restart on failure via liveness probes, GPU resource management and scheduling, horizontal scaling of model replicas, rolling updates for Ollama version changes, and integration with your existing cluster monitoring (Prometheus, Grafana). For teams already operating Kubernetes, keeping Ollama inside the cluster simplifies networking (pod-to-pod communication without firewall rules), secrets management, and deployment automation through the same GitOps workflows used for other services.

The main challenge with Ollama on Kubernetes is GPU resource management. GPU nodes are typically more expensive and less plentiful than CPU nodes, and Kubernetes’ GPU resource model (nvidia.com/gpu) schedules GPUs exclusively — a pod requesting a GPU claims it fully, even if inference uses only a fraction of GPU capacity. For clusters where GPU resources are scarce, consider whether the systemd approach on a dedicated GPU machine is simpler and more efficient than Kubernetes GPU pod scheduling. Kubernetes adds value proportional to the complexity it manages — for a single Ollama instance, the value is modest; for a multi-model, multi-team deployment with autoscaling, it becomes compelling.

Liveness and Readiness Probes

livenessProbe:
  httpGet:
    path: /
    port: 11434
  initialDelaySeconds: 30
  periodSeconds: 30
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /api/tags
    port: 11434
  initialDelaySeconds: 60  # Allow time for model loading
  periodSeconds: 10
  failureThreshold: 6

ConfigMap for Environment Variables

apiVersion: v1
kind: ConfigMap
metadata:
  name: ollama-config
  namespace: ai
data:
  OLLAMA_HOST: '0.0.0.0:11434'
  OLLAMA_KEEP_ALIVE: '1h'
  OLLAMA_MAX_LOADED_MODELS: '2'
  OLLAMA_NUM_PARALLEL: '2'
---
# Reference in Deployment:
envFrom:
- configMapRef:
    name: ollama-config

Multiple Replicas with LoadBalancer

Scaling Ollama to multiple replicas requires shared model storage — each pod needs access to the same pulled models. Use a ReadWriteMany PVC (NFS, EFS on AWS, Filestore on GCP) rather than ReadWriteOnce:

spec:
  replicas: 3
  # PVC must use ReadWriteMany storage class
  volumes:
  - name: ollama-data
    persistentVolumeClaim:
      claimName: ollama-pvc-rwx  # ReadWriteMany
---
apiVersion: v1
kind: Service
spec:
  type: LoadBalancer  # External access with load balancing across replicas
  selector:
    app: ollama

Monitoring with Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ollama
  namespace: ai
spec:
  selector:
    matchLabels:
      app: ollama
  endpoints:
  - port: http
    path: /metrics  # Requires the Prometheus exporter sidecar
    interval: 15s

Getting Started

Apply the basic deployment YAML from this article, wait for the pod to be ready, exec into it to pull your required models, and verify with a curl call to the service endpoint. The init container pattern is the most reliable way to ensure models are pre-pulled before the pod starts accepting traffic — it eliminates the cold start delay for the first request after deployment. For production deployments, add liveness and readiness probes to enable Kubernetes to detect and recover from Ollama failures automatically, and use a ConfigMap for environment configuration so model settings can be changed without rebuilding the container image.

Deploying Ollama on Kubernetes enables running local LLM inference at scale — multiple replicas, GPU node scheduling, rolling updates, and production-grade health checking. This guide covers the key Kubernetes resources needed to deploy Ollama as a reliable service in a cluster.

Why Kubernetes for Ollama

Kubernetes adds value over a simple systemd deployment when you need: multiple Ollama instances across different nodes (for load distribution or model specialisation), automatic restarts and health monitoring, rolling updates with zero downtime, resource quotas and GPU scheduling, and integration with the broader Kubernetes ecosystem (ingress controllers, service meshes, observability tools). For a single-machine personal deployment, systemd is simpler and sufficient. For team or production deployments where you want infrastructure-as-code, version control for your deployment configuration, and automatic recovery from node failures, Kubernetes is the right environment.

Namespace and ConfigMap

apiVersion: v1
kind: Namespace
metadata:
  name: ai
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ollama-config
  namespace: ai
data:
  OLLAMA_HOST: '0.0.0.0:11434'
  OLLAMA_KEEP_ALIVE: '30m'
  OLLAMA_MAX_LOADED_MODELS: '2'

Deployment with GPU Support

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      nodeSelector:
        accelerator: nvidia-gpu  # Target GPU nodes
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        envFrom:
        - configMapRef:
            name: ollama-config
        resources:
          limits:
            nvidia.com/gpu: 1  # Request 1 GPU
            memory: 16Gi
          requests:
            memory: 8Gi
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        livenessProbe:
          httpGet:
            path: /
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 10
          periodSeconds: 5
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc

PersistentVolumeClaim and Service

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
  namespace: ai
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi  # Adjust for your model sizes
  storageClassName: fast-ssd  # Use fast storage for models
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ai
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434

Init Container for Model Pulling

initContainers:
- name: pull-models
  image: ollama/ollama:latest
  command: ["/bin/sh", "-c"]
  args:
  - |
    ollama serve &
    sleep 5
    ollama pull llama3.2
    ollama pull nomic-embed-text
    pkill ollama
  volumeMounts:
  - name: ollama-data
    mountPath: /root/.ollama

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
  namespace: ai
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Considerations for Production

Model weights (20–70GB) take significant time to pull on first deployment. Use the init container pattern or pre-bake models into a custom image (FROM ollama/ollama:latest + RUN ollama serve & sleep 5 && ollama pull llama3.2) to avoid cold-pull delays on pod start. GPU scheduling requires the NVIDIA device plugin installed in the cluster — verify with kubectl describe node | grep nvidia. Multiple Ollama replicas serving different models on different GPU nodes is a common production pattern for teams with diverse model requirements. Each pod has its own model storage, so each replica independently manages its pulled models — coordinate model availability through your init containers or a shared NFS mount for the model cache.

Multi-Model Architecture on Kubernetes

A production Kubernetes Ollama deployment often runs multiple specialised instances rather than one general-purpose one. A common pattern: a fast small-model deployment (qwen2.5:1.5b, 2 CPU replicas) for high-volume classification and routing tasks, a medium-model deployment (llama3.2 7B, 1 GPU replica) for generation tasks, and a large-model deployment (llama3.1 70B quantised, 1 high-memory GPU replica) for complex reasoning tasks. Each deployment gets its own Kubernetes Service, and application code routes requests to the appropriate service based on task type. This architecture scales each tier independently and avoids GPU contention between model sizes.

Kubernetes namespaces provide logical isolation between environments — run development, staging, and production Ollama deployments in the same cluster without interference, with ResourceQuota objects limiting GPU and memory usage per namespace. Network policies restrict which application namespaces can reach the Ollama services, providing security isolation alongside logical isolation.

Deploying with Helm

A Helm chart bundles all the Kubernetes resources into a single deployable unit with configurable values:

# Install a community Ollama Helm chart
helm repo add ollama https://otwld.github.io/ollama-helm/
helm install ollama ollama/ollama \
  --namespace ai \
  --create-namespace \
  --set ollama.gpu.enabled=true \
  --set ollama.models={llama3.2,nomic-embed-text} \
  --set persistence.size=100Gi

Community Helm charts handle GPU node selection, PVC creation, init container model pulling, liveness/readiness probes, and service exposure with sensible defaults. Using a Helm chart rather than raw YAML reduces the YAML surface area you maintain and makes upgrades as simple as helm upgrade ollama ollama/ollama --set image.tag=latest.

Monitoring in Kubernetes

Deploy the Prometheus exporter from this article series as a sidecar container in the Ollama pod or as a separate Deployment in the same namespace. Configure ServiceMonitor resources (if using the Prometheus Operator) to auto-discover the exporter. The kube-state-metrics and node-exporter standard Kubernetes monitoring tools capture GPU utilisation, memory pressure, and pod restart counts alongside your Ollama-specific metrics — combining all signals in a single Grafana dashboard gives complete visibility into cluster-level health and Ollama-specific performance in one place.

Getting Started

Apply the Namespace, ConfigMap, PVC, Deployment, and Service manifests from this article in order using kubectl apply -f. Verify the pod is running with kubectl get pods -n ai and the readiness probe is passing. Port-forward to test locally: kubectl port-forward -n ai svc/ollama 11434:11434, then run curl http://localhost:11434/api/tags to confirm models are available. From a working single-node deployment, add the init container for automatic model pulling, configure the HPA for autoscaling, and connect your application services via the Kubernetes Service DNS name (ollama.ai.svc.cluster.local:11434).

Kubernetes vs Simpler Deployment Options

Kubernetes is not the right deployment target for every Ollama use case. For a personal or small-team deployment on a single machine, systemd is simpler to operate and has less overhead. For a team that is already running Docker Compose for their application stack, adding Ollama as a Compose service is the lowest-friction option. Kubernetes adds genuine value when you have: multiple nodes with GPUs that you want to schedule workloads across, existing Kubernetes infrastructure that your team already knows, a need for the full Kubernetes operational feature set (rolling updates, RBAC, network policies, ingress), or a requirement to run Ollama alongside other Kubernetes workloads with shared resource management. If none of these apply, adopt the simpler deployment first and migrate to Kubernetes when the operational requirements justify it.

The manifests in this article follow Kubernetes best practices: resource limits prevent a single pod from consuming all cluster resources, liveness and readiness probes give Kubernetes accurate signal for routing traffic and restarting unhealthy pods, and a PersistentVolumeClaim ensures model data survives pod restarts and rescheduling. These are the minimum viable production configuration — add RBAC for fine-grained access control, PodDisruptionBudgets for high availability, and NetworkPolicies for security isolation as your deployment matures.

CPU-Only Kubernetes Deployment

Not all Kubernetes clusters have GPU nodes. For CPU-only deployments, remove the nvidia.com/gpu resource limit and the nodeSelector for GPU nodes. CPU inference is significantly slower (5–20x) but viable for lower-volume workloads, smaller models (1.5–7B quantised), and environments where GPU nodes are unavailable or too expensive. A CPU-only Kubernetes deployment with 2–4 replicas can serve a small team using a 7B quantised model at acceptable latency for non-interactive batch tasks.

# CPU-only container spec (remove GPU resources)
containers:
- name: ollama
  image: ollama/ollama:latest
  resources:
    limits:
      memory: 16Gi
      cpu: '8'
    requests:
      memory: 8Gi
      cpu: '4'

Exposing Ollama via Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama-ingress
  namespace: ai
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: '300'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '300'
spec:
  rules:
  - host: ollama.internal.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama
            port:
              number: 11434

The proxy timeout annotations are critical — default NGINX timeouts of 60 seconds will abort long-running inference requests. Set both read and send timeouts to at least 300 seconds for 7B models on CPU, and higher for larger models. Add basic auth or OAuth2 proxy annotations for authentication if Ollama is exposed outside the cluster’s internal network.

Security Hardening

Ollama’s API has no built-in authentication. In Kubernetes, enforce authentication before requests reach Ollama using an OAuth2 proxy sidecar or an ingress controller with auth annotations. Network policies restrict which pods can reach the Ollama service, preventing unauthenticated access from pods that should not call it:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ollama-access
  namespace: ai
spec:
  podSelector:
    matchLabels:
      app: ollama
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          access-ollama: 'true'
    ports:
    - port: 11434

Label application namespaces with access-ollama: 'true' to grant them access. This ensures only authorised application workloads can call Ollama, regardless of where they are deployed in the cluster — a meaningful security boundary in shared clusters where different teams or applications share infrastructure.

Kubernetes as the Long-Term Foundation

Teams that deploy Ollama on Kubernetes today benefit from the same infrastructure advantages they get for all their other services: GitOps workflows where deployment manifests live in version control and changes are deployed via pull request, automatic pod scheduling across nodes for best resource utilisation, built-in health monitoring with automatic restarts, and seamless integration with the broader Kubernetes tooling ecosystem. The initial investment in writing Kubernetes manifests pays off as the deployment is managed consistently alongside the rest of the team’s infrastructure, using the same operational practices and tooling already in place. For organisations where Kubernetes is already the standard deployment platform, running Ollama anywhere other than Kubernetes would create an operational inconsistency — the manifests in this article ensure Ollama is a first-class citizen of your Kubernetes cluster rather than a special-cased exception — a consistent operational posture that reduces cognitive overhead and makes your infrastructure easier to manage at any scale — and a particularly sensible choice for AI infrastructure that you expect to grow with your application and team over time — the manifests you write today will serve as the foundation for a mature, observable, and secure AI infrastructure as your usage grows and the demands on your AI infrastructure evolve alongside your application and your team’s expertise with both AI and Kubernetes grows in parallel.

Leave a Comment