Deploying Ollama on Kubernetes makes sense when you need AI inference available across a cluster, want GPU-accelerated pods managed by Kubernetes, or need to integrate Ollama with an existing Kubernetes-based infrastructure. This guide covers a production-ready Ollama Kubernetes deployment with GPU support, persistent model storage, and service exposure.
Prerequisites
- A Kubernetes cluster with GPU nodes (NVIDIA GPU Operator installed, or CUDA-capable nodes pre-configured)
- kubectl and helm installed locally
- A persistent volume provisioner (for storing Ollama model files)
Basic Deployment
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
env:
- name: OLLAMA_HOST
value: '0.0.0.0:11434'
- name: OLLAMA_KEEP_ALIVE
value: '1h'
resources:
limits:
nvidia.com/gpu: '1' # Request 1 GPU
memory: '16Gi'
requests:
memory: '8Gi'
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: ai
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi # Enough for several 7B models
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ai
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
type: ClusterIP # Internal access only
Deploying and Pulling Models
# Apply the deployment
kubectl apply -f ollama-deployment.yaml
# Wait for the pod to be ready
kubectl wait --for=condition=ready pod -l app=ollama -n ai --timeout=120s
# Pull a model via kubectl exec
kubectl exec -n ai deploy/ollama -- ollama pull llama3.2
kubectl exec -n ai deploy/ollama -- ollama pull nomic-embed-text
# Test it works
kubectl exec -n ai deploy/ollama -- ollama run llama3.2 "Hello" --format json
Init Container for Pre-Pulling Models
initContainers:
- name: model-puller
image: ollama/ollama:latest
command:
- /bin/sh
- -c
- |
ollama serve &
sleep 5
ollama pull llama3.2
ollama pull nomic-embed-text
kill %1
env:
- name: OLLAMA_HOST
value: '0.0.0.0:11434'
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
Exposing via Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama
namespace: ai
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: '300'
nginx.ingress.kubernetes.io/proxy-send-timeout: '300'
spec:
rules:
- host: ollama.internal.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama
port:
number: 11434
CPU-Only Deployment (No GPU)
# Remove the nvidia.com/gpu limit and use small models
resources:
limits:
memory: '8Gi'
requests:
memory: '4Gi'
# Use small models: llama3.2:1b, qwen2.5:1.5b
Helm Chart
A community Helm chart for Ollama is available at otwld/ollama-helm:
helm repo add ollama-helm https://otwld.github.io/ollama-helm/
helm install ollama ollama-helm/ollama \
--namespace ai \
--set ollama.gpu.enabled=true \
--set ollama.gpu.number=1 \
--set ollama.models[0]=llama3.2 \
--set ollama.models[1]=nomic-embed-text \
--set persistentVolume.size=100Gi
Why Kubernetes for Ollama
Most Ollama deployments start simple — a single machine, a systemd service, done. Kubernetes becomes relevant when your infrastructure is already Kubernetes-based and adding a non-Kubernetes service creates an operational inconsistency, or when you need capabilities that Kubernetes provides: automatic restart on failure via liveness probes, GPU resource management and scheduling, horizontal scaling of model replicas, rolling updates for Ollama version changes, and integration with your existing cluster monitoring (Prometheus, Grafana). For teams already operating Kubernetes, keeping Ollama inside the cluster simplifies networking (pod-to-pod communication without firewall rules), secrets management, and deployment automation through the same GitOps workflows used for other services.
The main challenge with Ollama on Kubernetes is GPU resource management. GPU nodes are typically more expensive and less plentiful than CPU nodes, and Kubernetes’ GPU resource model (nvidia.com/gpu) schedules GPUs exclusively — a pod requesting a GPU claims it fully, even if inference uses only a fraction of GPU capacity. For clusters where GPU resources are scarce, consider whether the systemd approach on a dedicated GPU machine is simpler and more efficient than Kubernetes GPU pod scheduling. Kubernetes adds value proportional to the complexity it manages — for a single Ollama instance, the value is modest; for a multi-model, multi-team deployment with autoscaling, it becomes compelling.
Liveness and Readiness Probes
livenessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 60 # Allow time for model loading
periodSeconds: 10
failureThreshold: 6
ConfigMap for Environment Variables
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-config
namespace: ai
data:
OLLAMA_HOST: '0.0.0.0:11434'
OLLAMA_KEEP_ALIVE: '1h'
OLLAMA_MAX_LOADED_MODELS: '2'
OLLAMA_NUM_PARALLEL: '2'
---
# Reference in Deployment:
envFrom:
- configMapRef:
name: ollama-config
Multiple Replicas with LoadBalancer
Scaling Ollama to multiple replicas requires shared model storage — each pod needs access to the same pulled models. Use a ReadWriteMany PVC (NFS, EFS on AWS, Filestore on GCP) rather than ReadWriteOnce:
spec:
replicas: 3
# PVC must use ReadWriteMany storage class
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc-rwx # ReadWriteMany
---
apiVersion: v1
kind: Service
spec:
type: LoadBalancer # External access with load balancing across replicas
selector:
app: ollama
Monitoring with Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ollama
namespace: ai
spec:
selector:
matchLabels:
app: ollama
endpoints:
- port: http
path: /metrics # Requires the Prometheus exporter sidecar
interval: 15s
Getting Started
Apply the basic deployment YAML from this article, wait for the pod to be ready, exec into it to pull your required models, and verify with a curl call to the service endpoint. The init container pattern is the most reliable way to ensure models are pre-pulled before the pod starts accepting traffic — it eliminates the cold start delay for the first request after deployment. For production deployments, add liveness and readiness probes to enable Kubernetes to detect and recover from Ollama failures automatically, and use a ConfigMap for environment configuration so model settings can be changed without rebuilding the container image.
Deploying Ollama on Kubernetes enables running local LLM inference at scale — multiple replicas, GPU node scheduling, rolling updates, and production-grade health checking. This guide covers the key Kubernetes resources needed to deploy Ollama as a reliable service in a cluster.
Why Kubernetes for Ollama
Kubernetes adds value over a simple systemd deployment when you need: multiple Ollama instances across different nodes (for load distribution or model specialisation), automatic restarts and health monitoring, rolling updates with zero downtime, resource quotas and GPU scheduling, and integration with the broader Kubernetes ecosystem (ingress controllers, service meshes, observability tools). For a single-machine personal deployment, systemd is simpler and sufficient. For team or production deployments where you want infrastructure-as-code, version control for your deployment configuration, and automatic recovery from node failures, Kubernetes is the right environment.
Namespace and ConfigMap
apiVersion: v1
kind: Namespace
metadata:
name: ai
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ollama-config
namespace: ai
data:
OLLAMA_HOST: '0.0.0.0:11434'
OLLAMA_KEEP_ALIVE: '30m'
OLLAMA_MAX_LOADED_MODELS: '2'
Deployment with GPU Support
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
nodeSelector:
accelerator: nvidia-gpu # Target GPU nodes
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
envFrom:
- configMapRef:
name: ollama-config
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
memory: 16Gi
requests:
memory: 8Gi
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
livenessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/tags
port: 11434
initialDelaySeconds: 10
periodSeconds: 5
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
PersistentVolumeClaim and Service
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: ai
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi # Adjust for your model sizes
storageClassName: fast-ssd # Use fast storage for models
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ai
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
Init Container for Model Pulling
initContainers:
- name: pull-models
image: ollama/ollama:latest
command: ["/bin/sh", "-c"]
args:
- |
ollama serve &
sleep 5
ollama pull llama3.2
ollama pull nomic-embed-text
pkill ollama
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
namespace: ai
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama
minReplicas: 1
maxReplicas: 4
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Considerations for Production
Model weights (20–70GB) take significant time to pull on first deployment. Use the init container pattern or pre-bake models into a custom image (FROM ollama/ollama:latest + RUN ollama serve & sleep 5 && ollama pull llama3.2) to avoid cold-pull delays on pod start. GPU scheduling requires the NVIDIA device plugin installed in the cluster — verify with kubectl describe node | grep nvidia. Multiple Ollama replicas serving different models on different GPU nodes is a common production pattern for teams with diverse model requirements. Each pod has its own model storage, so each replica independently manages its pulled models — coordinate model availability through your init containers or a shared NFS mount for the model cache.
Multi-Model Architecture on Kubernetes
A production Kubernetes Ollama deployment often runs multiple specialised instances rather than one general-purpose one. A common pattern: a fast small-model deployment (qwen2.5:1.5b, 2 CPU replicas) for high-volume classification and routing tasks, a medium-model deployment (llama3.2 7B, 1 GPU replica) for generation tasks, and a large-model deployment (llama3.1 70B quantised, 1 high-memory GPU replica) for complex reasoning tasks. Each deployment gets its own Kubernetes Service, and application code routes requests to the appropriate service based on task type. This architecture scales each tier independently and avoids GPU contention between model sizes.
Kubernetes namespaces provide logical isolation between environments — run development, staging, and production Ollama deployments in the same cluster without interference, with ResourceQuota objects limiting GPU and memory usage per namespace. Network policies restrict which application namespaces can reach the Ollama services, providing security isolation alongside logical isolation.
Deploying with Helm
A Helm chart bundles all the Kubernetes resources into a single deployable unit with configurable values:
# Install a community Ollama Helm chart
helm repo add ollama https://otwld.github.io/ollama-helm/
helm install ollama ollama/ollama \
--namespace ai \
--create-namespace \
--set ollama.gpu.enabled=true \
--set ollama.models={llama3.2,nomic-embed-text} \
--set persistence.size=100Gi
Community Helm charts handle GPU node selection, PVC creation, init container model pulling, liveness/readiness probes, and service exposure with sensible defaults. Using a Helm chart rather than raw YAML reduces the YAML surface area you maintain and makes upgrades as simple as helm upgrade ollama ollama/ollama --set image.tag=latest.
Monitoring in Kubernetes
Deploy the Prometheus exporter from this article series as a sidecar container in the Ollama pod or as a separate Deployment in the same namespace. Configure ServiceMonitor resources (if using the Prometheus Operator) to auto-discover the exporter. The kube-state-metrics and node-exporter standard Kubernetes monitoring tools capture GPU utilisation, memory pressure, and pod restart counts alongside your Ollama-specific metrics — combining all signals in a single Grafana dashboard gives complete visibility into cluster-level health and Ollama-specific performance in one place.
Getting Started
Apply the Namespace, ConfigMap, PVC, Deployment, and Service manifests from this article in order using kubectl apply -f. Verify the pod is running with kubectl get pods -n ai and the readiness probe is passing. Port-forward to test locally: kubectl port-forward -n ai svc/ollama 11434:11434, then run curl http://localhost:11434/api/tags to confirm models are available. From a working single-node deployment, add the init container for automatic model pulling, configure the HPA for autoscaling, and connect your application services via the Kubernetes Service DNS name (ollama.ai.svc.cluster.local:11434).
Kubernetes vs Simpler Deployment Options
Kubernetes is not the right deployment target for every Ollama use case. For a personal or small-team deployment on a single machine, systemd is simpler to operate and has less overhead. For a team that is already running Docker Compose for their application stack, adding Ollama as a Compose service is the lowest-friction option. Kubernetes adds genuine value when you have: multiple nodes with GPUs that you want to schedule workloads across, existing Kubernetes infrastructure that your team already knows, a need for the full Kubernetes operational feature set (rolling updates, RBAC, network policies, ingress), or a requirement to run Ollama alongside other Kubernetes workloads with shared resource management. If none of these apply, adopt the simpler deployment first and migrate to Kubernetes when the operational requirements justify it.
The manifests in this article follow Kubernetes best practices: resource limits prevent a single pod from consuming all cluster resources, liveness and readiness probes give Kubernetes accurate signal for routing traffic and restarting unhealthy pods, and a PersistentVolumeClaim ensures model data survives pod restarts and rescheduling. These are the minimum viable production configuration — add RBAC for fine-grained access control, PodDisruptionBudgets for high availability, and NetworkPolicies for security isolation as your deployment matures.
CPU-Only Kubernetes Deployment
Not all Kubernetes clusters have GPU nodes. For CPU-only deployments, remove the nvidia.com/gpu resource limit and the nodeSelector for GPU nodes. CPU inference is significantly slower (5–20x) but viable for lower-volume workloads, smaller models (1.5–7B quantised), and environments where GPU nodes are unavailable or too expensive. A CPU-only Kubernetes deployment with 2–4 replicas can serve a small team using a 7B quantised model at acceptable latency for non-interactive batch tasks.
# CPU-only container spec (remove GPU resources)
containers:
- name: ollama
image: ollama/ollama:latest
resources:
limits:
memory: 16Gi
cpu: '8'
requests:
memory: 8Gi
cpu: '4'
Exposing Ollama via Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
namespace: ai
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: '300'
nginx.ingress.kubernetes.io/proxy-send-timeout: '300'
spec:
rules:
- host: ollama.internal.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama
port:
number: 11434
The proxy timeout annotations are critical — default NGINX timeouts of 60 seconds will abort long-running inference requests. Set both read and send timeouts to at least 300 seconds for 7B models on CPU, and higher for larger models. Add basic auth or OAuth2 proxy annotations for authentication if Ollama is exposed outside the cluster’s internal network.
Security Hardening
Ollama’s API has no built-in authentication. In Kubernetes, enforce authentication before requests reach Ollama using an OAuth2 proxy sidecar or an ingress controller with auth annotations. Network policies restrict which pods can reach the Ollama service, preventing unauthenticated access from pods that should not call it:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ollama-access
namespace: ai
spec:
podSelector:
matchLabels:
app: ollama
ingress:
- from:
- namespaceSelector:
matchLabels:
access-ollama: 'true'
ports:
- port: 11434
Label application namespaces with access-ollama: 'true' to grant them access. This ensures only authorised application workloads can call Ollama, regardless of where they are deployed in the cluster — a meaningful security boundary in shared clusters where different teams or applications share infrastructure.
Kubernetes as the Long-Term Foundation
Teams that deploy Ollama on Kubernetes today benefit from the same infrastructure advantages they get for all their other services: GitOps workflows where deployment manifests live in version control and changes are deployed via pull request, automatic pod scheduling across nodes for best resource utilisation, built-in health monitoring with automatic restarts, and seamless integration with the broader Kubernetes tooling ecosystem. The initial investment in writing Kubernetes manifests pays off as the deployment is managed consistently alongside the rest of the team’s infrastructure, using the same operational practices and tooling already in place. For organisations where Kubernetes is already the standard deployment platform, running Ollama anywhere other than Kubernetes would create an operational inconsistency — the manifests in this article ensure Ollama is a first-class citizen of your Kubernetes cluster rather than a special-cased exception — a consistent operational posture that reduces cognitive overhead and makes your infrastructure easier to manage at any scale — and a particularly sensible choice for AI infrastructure that you expect to grow with your application and team over time — the manifests you write today will serve as the foundation for a mature, observable, and secure AI infrastructure as your usage grows and the demands on your AI infrastructure evolve alongside your application and your team’s expertise with both AI and Kubernetes grows in parallel.