How to Deploy ML Models on Kubernetes with KServe

KServe (formerly KFServing) is the standard Kubernetes-native model serving platform for production ML workloads. It runs on top of Kubernetes and provides a unified interface for deploying models from any framework — PyTorch, TensorFlow, scikit-learn, XGBoost, Hugging Face Transformers — with built-in autoscaling, canary deployments, model versioning, and request batching. Compared to writing your own FastAPI service and Kubernetes Deployment manifests, KServe handles the operational complexity: traffic routing between model versions, scale-to-zero with Knative, GPU scheduling, and a standardized inference protocol (v2 inference protocol) that works the same across frameworks.

This guide covers the complete path from a trained model to a production KServe endpoint: cluster setup, deploying a PyTorch model, setting resource requests and limits, configuring autoscaling, canary rollouts, and monitoring inference endpoints.

Prerequisites and Cluster Setup

KServe requires a Kubernetes cluster with at least Kubernetes 1.25, cert-manager for TLS certificate management, and either Knative Serving (for serverless scale-to-zero deployments) or a raw Kubernetes Ingress controller (for always-on deployments). On managed clusters, EKS, GKE, and AKS all work — GKE with Autopilot has the simplest setup since GPU node pools and cluster autoscaling are handled automatically.

# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

# Install Knative Serving
kubectl apply -f https://github.com/knative/serving/releases/latest/download/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/latest/download/serving-core.yaml

# Install KServe
kubectl apply -f https://github.com/kserve/kserve/releases/latest/download/kserve.yaml
kubectl apply -f https://github.com/kserve/kserve/releases/latest/download/kserve-runtimes.yaml

# Verify installation
kubectl get pods -n kserve

For local development and testing, kind (Kubernetes in Docker) with the KServe quick install script gets you a working cluster in minutes without cloud costs. Production clusters should use separate namespaces per team or environment, with RBAC configured to restrict who can create InferenceService resources.

Packaging Your Model

KServe expects models to be stored in a supported format in an accessible storage location — S3, GCS, Azure Blob, or a PVC. For PyTorch models, you can use TorchServe (the default PyTorch serving runtime) or the Hugging Face runtime for transformer models. The simplest path for a custom PyTorch model is to save it in TorchScript format and push to S3:

import torch

# Save model in TorchScript format
model = MyModel()
model.load_state_dict(torch.load("checkpoint.pt"))
model.eval()

scripted = torch.jit.script(model)
scripted.save("model.pt")

# Upload to S3
import boto3
s3 = boto3.client("s3")
s3.upload_file("model.pt", "my-models-bucket", "my-model/v1/model.pt")

For Hugging Face transformer models, you save in the standard HuggingFace format (config.json, tokenizer files, model weights) to a directory and upload the entire directory to S3. The KServe HuggingFace runtime handles loading automatically with no custom handler code required.

Creating an InferenceService

KServe’s core abstraction is the InferenceService custom resource. You define your model deployment as a YAML manifest and apply it to Kubernetes — KServe handles creating the underlying pods, services, and ingress routes:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-pytorch-model
  namespace: ml-serving
spec:
  predictor:
    pytorch:
      storageUri: s3://my-models-bucket/my-model/v1
      resources:
        requests:
          cpu: "2"
          memory: 4Gi
        limits:
          cpu: "4"
          memory: 8Gi
          nvidia.com/gpu: "1"
      runtimeVersion: "0.9.0"

kubectl apply -f inference-service.yaml
kubectl get inferenceservice my-pytorch-model -n ml-serving
# Wait for READY = True (may take 2-5 minutes on first deploy as image is pulled)

Once ready, KServe exposes the model at a predictable URL following the pattern: http://<service-name>.<namespace>.<ingress-domain>/v2/models/<model-name>/infer. The v2 inference protocol uses a standard JSON schema for requests and responses, making it easy to switch between frameworks without changing client code.

Sending Inference Requests

import requests
import json
import numpy as np

ENDPOINT = "http://my-pytorch-model.ml-serving.example.com/v2/models/my-pytorch-model/infer"

# V2 inference protocol request format
payload = {
    "inputs": [{
        "name": "input-0",
        "shape": [1, 3, 224, 224],
        "datatype": "FP32",
        "data": np.random.randn(1, 3, 224, 224).flatten().tolist()
    }]
}

response = requests.post(
    ENDPOINT,
    headers={"Content-Type": "application/json"},
    data=json.dumps(payload),
    timeout=30,
)

result = response.json()
print(result["outputs"][0]["data"][:5])

For low-latency production use, avoid JSON serialization for large tensors — use the binary data protocol extension of v2, which serializes tensor data as raw bytes and is 5–10x faster for large inputs. The KServe Python client library (pip install kserve) handles binary serialization automatically.

Autoscaling Configuration

KServe with Knative scales pods based on request concurrency (the number of in-flight requests per pod). The default target concurrency is 1, meaning one pod is created per concurrent request — too aggressive for most ML models. Set it based on your model’s measured throughput and acceptable latency under load:

spec:
  predictor:
    minReplicas: 1       # keep at least 1 pod warm (set to 0 for true scale-to-zero)
    maxReplicas: 10      # hard cap to control costs
    scaleTarget: 5       # target concurrent requests per pod
    scaleMetric: concurrency
    pytorch:
      storageUri: s3://my-models-bucket/my-model/v1
      resources:
        limits:
          nvidia.com/gpu: "1"

Scale-to-zero (minReplicas: 0) eliminates cost when there’s no traffic but introduces cold-start latency — typically 30–90 seconds for GPU pods as the node comes up, the image is pulled, and the model is loaded. For latency-sensitive APIs, keep minReplicas: 1. For batch or async workloads where occasional cold starts are acceptable, scale-to-zero meaningfully reduces idle GPU costs.

Canary Deployments

KServe’s canary routing splits traffic between two model versions without needing a separate load balancer configuration:

spec:
  predictor:
    canaryTrafficPercent: 10     # 10% to new version
    pytorch:
      storageUri: s3://my-models-bucket/my-model/v2   # new version
      resources:
        limits:
          nvidia.com/gpu: "1"

The previous version (v1) continues serving 90% of traffic while v2 handles 10%. Monitor error rates, latency, and prediction quality metrics for v2 before increasing canaryTrafficPercent to 50, then 100. When you’re satisfied, remove canaryTrafficPercent entirely to promote v2 to full traffic. This pattern lets you validate model quality in production on real traffic without full exposure — particularly important for models where offline evaluation metrics don’t perfectly predict production behavior.

Hugging Face Models on KServe

For LLMs and transformer models from the Hugging Face Hub, KServe’s HuggingFace runtime simplifies deployment significantly — no custom handler code, no TorchScript conversion, just point at the model:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm-endpoint
  namespace: ml-serving
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=llm-endpoint
        - --model_id=meta-llama/Meta-Llama-3-8B-Instruct
        - --max_length=2048
        - --dtype=bfloat16
      resources:
        limits:
          nvidia.com/gpu: "2"
          memory: 80Gi

The HuggingFace runtime downloads the model from the Hub on first startup (cache it in a PVC for faster subsequent starts), loads it with the specified dtype, and exposes a chat completion-compatible endpoint. For production LLM serving with high throughput requirements, consider the vLLM runtime instead — KServe has a native vLLM integration that enables continuous batching and PagedAttention for much higher token throughput than the default HuggingFace runtime.

When to Use KServe vs Simpler Alternatives

KServe’s value proposition is real but it comes with genuine operational overhead. The Kubernetes + Knative + KServe stack has significant moving parts, requires cluster expertise to operate, and adds 1–3 hours of setup time even for experienced users. For small teams deploying one or two models, a well-configured FastAPI container on a single GPU instance behind an AWS Application Load Balancer is simpler, costs less to operate, and handles hundreds of requests per second without the orchestration layer. KServe pays off when you’re managing many models across teams (the standardized InferenceService API makes model deployment self-service), need canary deployments and model versioning as first-class features, or require scale-to-zero for cost control across a large fleet of infrequently-used models. The inflection point for most organizations is roughly 10+ models in production, 3+ ML engineers managing deployments, and a need for consistent deployment practices across teams — below that threshold, the simpler path usually wins on total engineering cost.

Resource Management for GPU Workloads

GPU resource management in KServe requires understanding how Kubernetes handles GPU requests. Unlike CPU and memory, GPUs are not divisible by default — a request for nvidia.com/gpu: “1” allocates an entire physical GPU to that pod. This means a model that only uses 20% of a GPU’s compute capacity still occupies the full device, which is wasteful at scale. NVIDIA’s MIG (Multi-Instance GPU) feature on A100 and H100 GPUs lets you partition a physical GPU into isolated slices, and KServe supports MIG profiles through the device plugin. For inference workloads where many small models share a GPU fleet, MIG can significantly improve utilization — you can run 7 separate 1g.10gb MIG slices on a single A100 80GB, each getting dedicated compute and memory resources.

For GPU time-sharing (multiplexing without MIG), the NVIDIA GPU Operator supports time-slicing where multiple pods share a single GPU. Time-slicing has no memory isolation between pods, so a misbehaving model can OOM and affect co-located models. Use MIG where possible for production workloads; use time-slicing only for development environments or models with well-understood, stable memory footprints. Set resource requests equal to limits for GPU pods — Kubernetes requires this for device resources, and setting them unequal either fails validation or is silently corrected to the limit.

Memory requests and limits for the container should account for both the GPU memory (which Kubernetes doesn’t directly manage — it’s allocated by the CUDA runtime when the model loads) and the host CPU memory (which Kubernetes does manage). For a 7B parameter model in bfloat16, the GPU memory is fixed at ~14GB, but host memory includes the process overhead, any CPU preprocessing, and the serving framework’s buffers. A conservative rule of thumb is to set the container memory limit to at least 2x the model size in bytes to account for framework overhead and transient allocations during request handling.

Logging, Tracing, and Observability

KServe integrates with standard Kubernetes observability tooling. Prometheus metrics are exposed automatically on port 8080 at /metrics — they include request count, latency histograms, and error rates for each InferenceService. Grafana dashboards for KServe are available in the kserve/kserve repository and can be imported directly. For distributed tracing, KServe supports Jaeger and Zipkin through the OpenTelemetry collector — adding OTEL annotations to your InferenceService manifest enables trace propagation through the serving pipeline.

Request logging — capturing input features and predictions for post-hoc analysis and retraining — is handled through KServe’s data plane logging feature. You configure a CloudEvent destination (an S3 bucket, a Kafka topic, or a generic HTTP endpoint) in the InferenceService spec, and KServe logs all requests and responses as CloudEvents to that destination asynchronously without adding latency to the inference path. This is the standard pattern for building retraining pipelines that are triggered by prediction quality degradation in production: log predictions, run offline evaluation against ground truth labels as they arrive, and trigger retraining when quality metrics cross a threshold.

Model explainability is a first-class concept in KServe’s architecture. You can attach an explainer component to an InferenceService alongside the predictor — KServe ships built-in support for SHAP and Alibi as explainer backends. When an explainability request comes in (via the /explain endpoint instead of /infer), KServe routes it to the explainer, which queries the predictor and computes feature attributions. This architecture keeps explainability computation out of the hot inference path while making it accessible through the same endpoint abstraction.

Handling Model Dependencies and Custom Runtimes

Not every model fits the standard KServe runtimes. If you have a custom preprocessing pipeline, a non-standard model format, or framework dependencies that don’t match any built-in runtime, you can write a custom KServe model server by subclassing kserve.Model and implementing the load and predict methods. This gives you full control over initialization and inference while keeping the KServe InferenceService abstraction for deployment, scaling, and traffic management:

import kserve
from typing import Dict, Any
import torch

class MyCustomModel(kserve.Model):
    def __init__(self, name: str):
        super().__init__(name)
        self.model = None
        self.ready = False

    def load(self):
        # Called once at startup; blocking until complete
        self.model = torch.load("/mnt/models/model.pt", map_location="cuda")
        self.model.eval()
        self.ready = True

    async def predict(self, payload: Dict[str, Any], headers: Dict[str, str] = None) -> Dict:
        inputs = payload["inputs"][0]["data"]
        tensor = torch.tensor(inputs).cuda().reshape(1, -1)
        with torch.no_grad():
            output = self.model(tensor)
        return {"outputs": [{"name": "output", "datatype": "FP32",
                              "shape": list(output.shape),
                              "data": output.cpu().tolist()}]}

if __name__ == "__main__":
    model = MyCustomModel("my-custom-model")
    kserve.ModelServer().start([model])

Package this as a Docker image, push to a container registry, and reference it in an InferenceService spec with a custom container spec instead of one of the built-in runtime types. The load method runs synchronously at startup, so KServe won’t route traffic to the pod until it returns — use this to ensure the model is fully warmed up before serving begins.

Production Checklist Before Going Live

Before routing production traffic to a KServe endpoint, work through these checks. First, validate that the model output matches the output of your offline evaluation setup on a representative set of inputs — subtle differences in preprocessing (tokenization order, normalization constants, input shape expectations) can cause the serving model to produce different results than the model you evaluated. Second, load test the endpoint at your expected peak traffic level plus a 50% safety margin, measuring p50, p95, and p99 latency under load. KServe’s autoscaling helps absorb traffic spikes, but there’s always a lag between the autoscaler detecting high concurrency and new pods becoming ready — size your minReplicas so that the existing pods can handle expected peak traffic without waiting for scale-out. Third, verify that your health checks are correct — KServe uses Kubernetes readiness and liveness probes, and a misconfigured readiness probe that marks pods ready before the model is loaded leads to failed requests during pod startup. The kserve.Model base class handles readiness automatically when you set self.ready = True in your load method, but custom containers need to implement the /health/ready endpoint themselves. Finally, test your rollback path: deploy a canary, verify it, then practice rolling back to the previous version by removing the canary spec — the first time you need to rollback under pressure is not the time to learn how it works.

Networking configuration is a frequent source of production issues. KServe endpoints behind a corporate firewall or VPC may need specific ingress annotations for your load balancer (ALB on AWS, GCE L7 on GKE) to handle gRPC connections correctly — the v2 inference protocol supports both REST and gRPC, and gRPC has different connection requirements than HTTP/1.1. Set appropriate connection timeouts for your model’s inference latency — KServe defaults are designed for fast models and will time out large language model requests that take several seconds to generate a response. For LLM streaming endpoints, configure the ingress to support HTTP/2 and server-sent events, since streaming responses require connection persistence that standard HTTP/1.1 load balancers may terminate prematurely.