When you’re scaling AI and machine learning workloads beyond a single machine, the complexity of distributed computing quickly becomes overwhelming. Managing distributed training across multiple GPUs, coordinating hyperparameter tuning experiments, serving models at scale, and orchestrating data preprocessing pipelines all require sophisticated infrastructure. Ray and Kubernetes have emerged as the dominant combination for AI workload orchestration—Ray provides the distributed computing framework specifically designed for ML workloads, while Kubernetes handles the underlying infrastructure orchestration. Together, they enable you to build scalable AI platforms that efficiently utilize resources, automatically recover from failures, and adapt to changing workload demands. Understanding how to leverage both technologies together is essential for modern ML engineering teams building production-grade AI systems.
Why Ray and Kubernetes Complement Each Other
Before diving into implementation patterns, it’s crucial to understand what each technology provides and why using them together creates a more powerful platform than either alone.
What Ray Brings to AI Workloads
Ray is a distributed computing framework designed from the ground up for AI and ML workloads. It provides several capabilities that make distributed ML dramatically easier:
Ray’s core abstraction is the task—a Python function that can execute on any available node in the cluster. Unlike traditional distributed computing frameworks that require complex setup, Ray tasks are defined with a simple decorator. Ray handles scheduling these tasks across your cluster, managing data transfer between nodes, and recovering from failures.
Ray includes specialized libraries for common ML patterns: Ray Train for distributed training (supporting PyTorch, TensorFlow, XGBoost), Ray Tune for hyperparameter optimization, Ray Serve for model serving, and Ray Data for scalable data processing. These libraries understand ML workload characteristics—they know about GPU placement, gradient synchronization, model checkpointing, and distributed data loading.
Resource management in Ray is ML-aware. You can specify that a task needs 2 GPUs and 16 CPUs, and Ray schedules it only on nodes with available resources. Ray understands fractional GPU sharing, gang scheduling (allocating resources for all workers simultaneously), and placement groups (colocating related tasks).
What Kubernetes Provides
Kubernetes excels at infrastructure orchestration. It manages the lifecycle of containers, handles node failures by rescheduling pods, provides load balancing and service discovery, and offers declarative configuration for desired cluster state.
For Ray workloads, Kubernetes provides:
- Automatic scaling of Ray cluster size based on workload demands
- Health monitoring and automatic restart of failed Ray nodes
- Isolation between different teams’ Ray clusters through namespaces
- Integration with cloud provider features (storage, networking, IAM)
- Standardized deployment patterns across different environments
The Synergy
Ray provides the application-level orchestration—deciding which tasks run where, managing distributed state, and coordinating ML-specific patterns. Kubernetes provides infrastructure-level orchestration—ensuring containers are running, scaling resources up and down, and managing failures at the node level.
This separation of concerns is powerful. Ray doesn’t need to implement container lifecycle management, node provisioning, or cloud provider integration. Kubernetes doesn’t need to understand gradient synchronization, hyperparameter search spaces, or model serving patterns. Each does what it does best.
Ray + Kubernetes: Division of Responsibilities
Ray Handles: Task scheduling, distributed state, ML-specific patterns (training, tuning, serving), resource allocation within cluster
Kubernetes Handles: Container lifecycle, node management, auto-scaling, networking, storage, failure recovery
Together They Provide: Elastic ML platform that scales resources dynamically, recovers from failures automatically, and efficiently utilizes GPUs
Deploying Ray on Kubernetes: Architecture and Patterns
The KubeRay operator is the standard way to deploy and manage Ray clusters on Kubernetes. Understanding its architecture and deployment patterns is essential for building reliable AI platforms.
KubeRay Operator Architecture
KubeRay introduces custom Kubernetes resources (CRDs) for Ray clusters. When you create a RayCluster resource, the KubeRay operator watches for this resource and creates the necessary Kubernetes pods, services, and configurations to run a Ray cluster.
A Ray cluster consists of:
- Head Node: Single pod running the Ray head process, which manages cluster state, schedules tasks, and provides the entry point for submitting jobs
- Worker Nodes: Multiple pods running Ray worker processes, which execute tasks and hold distributed state
- Services: Kubernetes services enabling communication between Ray components and exposing endpoints for job submission
The KubeRay operator continuously reconciles desired state (defined in your RayCluster resource) with actual state, creating new pods when workers are needed and removing them when not.
Basic RayCluster Configuration
Here’s a foundational RayCluster configuration for ML workloads:
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: ml-training-cluster
namespace: ml-platform
spec:
# Ray version
rayVersion: '2.9.0'
# Head node configuration
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
num-cpus: '0' # Don't schedule tasks on head
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.9.0-py310
ports:
- containerPort: 6379 # Redis
name: gcs
- containerPort: 8265 # Dashboard
name: dashboard
- containerPort: 10001 # Client
name: client
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "2"
memory: "8Gi"
volumeMounts:
- name: ray-logs
mountPath: /tmp/ray
volumes:
- name: ray-logs
emptyDir: {}
# Worker node groups
workerGroupSpecs:
# CPU worker group
- groupName: cpu-workers
replicas: 3
minReplicas: 1
maxReplicas: 10
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.9.0-py310
resources:
requests:
cpu: "4"
memory: "16Gi"
limits:
cpu: "4"
memory: "16Gi"
# GPU worker group for training
- groupName: gpu-workers
replicas: 2
minReplicas: 0
maxReplicas: 5
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.9.0-py310-gpu
resources:
requests:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-tesla-v100
This configuration creates a Ray cluster with heterogeneous worker types—CPU workers for data processing and GPU workers for training. The minReplicas and maxReplicas enable autoscaling, adding or removing workers based on workload demands.
Autoscaling Configuration
Ray’s autoscaler integrates with Kubernetes to dynamically adjust cluster size:
# Add to RayCluster spec
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 60
resources:
- name: CPU
minReplicas: 2
maxReplicas: 10
- name: GPU
minReplicas: 0
maxReplicas: 5
When Ray tasks request resources that aren’t available, the autoscaler requests additional worker pods from Kubernetes. When workers are idle for the timeout period, they’re scaled down. This elasticity is crucial for cost efficiency—you only pay for resources when actively processing workloads.
Orchestrating Distributed Training Workloads
Distributed training is one of the most resource-intensive AI workloads, and Ray’s integration with Kubernetes makes it significantly more manageable.
Ray Train for Multi-GPU Training
Ray Train provides a simple API for distributed training that handles the complexity of multi-node, multi-GPU synchronization:
import ray
from ray import train
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
import torch
import torch.nn as nn
def train_func(config):
"""Training function that runs on each worker"""
# Get distributed training context
rank = train.get_context().get_world_rank()
# Create model
model = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
model = train.torch.prepare_model(model) # Wrap for distributed
# Create optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])
# Training loop
for epoch in range(config["num_epochs"]):
# Load data (Ray Train handles data distribution)
dataloader = get_dataloader(config["batch_size"])
dataloader = train.torch.prepare_data_loader(dataloader)
for batch_idx, (data, target) in enumerate(dataloader):
optimizer.zero_grad()
output = model(data)
loss = nn.functional.cross_entropy(output, target)
loss.backward()
optimizer.step()
# Report metrics (aggregated across workers)
if batch_idx % 100 == 0:
train.report({"loss": loss.item(), "epoch": epoch})
# Configure distributed training
trainer = TorchTrainer(
train_func,
train_loop_config={
"lr": 0.001,
"batch_size": 64,
"num_epochs": 10
},
scaling_config=ScalingConfig(
num_workers=4, # 4 GPU workers
use_gpu=True,
resources_per_worker={"CPU": 4, "GPU": 1}
)
)
# Connect to Ray cluster on Kubernetes
ray.init(address="ray://ml-training-cluster-head:10001")
# Run training
result = trainer.fit()
print(f"Training completed. Final loss: {result.metrics['loss']}")
This code runs seamlessly whether on a local machine or a Kubernetes-hosted Ray cluster with 100 GPUs. Ray Train handles:
- Data distribution across workers
- Gradient synchronization between GPUs
- Fault tolerance—if a worker fails, training can continue
- Metric aggregation and reporting
- Checkpoint saving for model recovery
Resource Allocation Strategies
Different training strategies require different resource allocation patterns:
Data Parallel Training: Multiple workers train on different data batches with the same model. Allocate 1 GPU per worker, typically 4-8 CPUs per GPU for data loading:
ScalingConfig(
num_workers=8,
use_gpu=True,
resources_per_worker={"CPU": 8, "GPU": 1}
)
Model Parallel Training: Large models split across multiple GPUs. Allocate multiple GPUs per worker:
ScalingConfig(
num_workers=2, # 2 model replicas
use_gpu=True,
resources_per_worker={"CPU": 16, "GPU": 4} # Each replica uses 4 GPUs
)
Pipeline Parallel Training: Model stages on different GPUs. Use placement groups to ensure proper GPU allocation:
from ray.util.placement_group import placement_group
# Create placement group ensuring GPUs are on same node
pg = placement_group([{"GPU": 4, "CPU": 16}] * 2) # 2 nodes with 4 GPUs each
ScalingConfig(
num_workers=8,
use_gpu=True,
placement_strategy=pg
)
Ray’s resource management ensures these allocations are satisfied before starting training, preventing partial allocations that would cause failures.
Hyperparameter Tuning at Scale
Hyperparameter optimization often requires running dozens or hundreds of training trials. Ray Tune provides efficient orchestration for this workload pattern.
Ray Tune Architecture
Ray Tune schedules multiple training trials across your Ray cluster, dynamically allocating resources and implementing sophisticated algorithms like population-based training or Bayesian optimization. On Kubernetes, this means trials automatically scale across available nodes, utilizing GPUs efficiently.
Implementing Scalable Hyperparameter Search
from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch
def trainable(config):
"""Single trial training function"""
# Setup model with config hyperparameters
model = create_model(
hidden_size=config["hidden_size"],
num_layers=config["num_layers"],
dropout=config["dropout"]
)
optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])
# Training loop with metrics reporting
for epoch in range(10):
train_loss = train_epoch(model, optimizer)
val_loss = validate(model)
# Report intermediate results for early stopping
tune.report(val_loss=val_loss, train_loss=train_loss)
# Configure search space
search_space = {
"lr": tune.loguniform(1e-5, 1e-1),
"hidden_size": tune.choice([128, 256, 512, 1024]),
"num_layers": tune.randint(2, 6),
"dropout": tune.uniform(0.1, 0.5)
}
# ASHA scheduler for early stopping of poor trials
scheduler = ASHAScheduler(
max_t=10, # Max epochs per trial
grace_period=1, # Min epochs before stopping
reduction_factor=2
)
# Bayesian optimization for intelligent search
search_alg = BayesOptSearch(
metric="val_loss",
mode="min"
)
# Run tuning across Ray cluster
ray.init(address="ray://ml-training-cluster-head:10001")
analysis = tune.run(
trainable,
config=search_space,
num_samples=100, # 100 trials
scheduler=scheduler,
search_alg=search_alg,
resources_per_trial={"cpu": 4, "gpu": 1},
verbose=1
)
# Get best configuration
best_config = analysis.get_best_config(metric="val_loss", mode="min")
print(f"Best hyperparameters: {best_config}")
This search runs 100 trials across your Kubernetes Ray cluster. If you have 10 GPUs available, Ray Tune runs 10 trials concurrently. The ASHA scheduler stops unpromising trials early, freeing resources for new trials. Bayesian optimization focuses search on promising regions of hyperparameter space.
The key advantage on Kubernetes: as your tuning job requests more resources (many concurrent trials), the Ray autoscaler can add worker pods, scaling your cluster to match workload demands. When tuning completes, workers scale down automatically.
Best Practices for Ray + Kubernetes
- Resource Requests = Limits: Set equal values to ensure predictable scheduling and avoid node oversubscription
- Persistent Storage: Use PersistentVolumes for checkpoints and logs to survive pod restarts
- Health Checks: Configure liveness and readiness probes for reliable failure detection
- Network Policies: Isolate Ray clusters with Kubernetes network policies for security
- Resource Quotas: Prevent runaway autoscaling with namespace resource quotas
- Monitoring: Export Ray metrics to Prometheus for visibility into cluster utilization
Serving ML Models with Ray Serve on Kubernetes
Once models are trained, serving them at scale requires orchestration that handles traffic routing, autoscaling, and version management. Ray Serve on Kubernetes provides this capability.
Ray Serve Architecture
Ray Serve is a model serving framework that runs on top of Ray. It provides:
- HTTP/gRPC endpoints for model inference
- Autoscaling based on request load
- Model composition (chaining multiple models)
- A/B testing and canary deployments
- Batching for throughput optimization
On Kubernetes, Ray Serve deployments are Ray clusters configured for serving workloads rather than training.
Deploying Models with Ray Serve
from ray import serve
import ray
# Connect to Ray cluster
ray.init(address="ray://serving-cluster-head:10001")
# Start Ray Serve
serve.start(http_options={"host": "0.0.0.0", "port": 8000})
@serve.deployment(
num_replicas=2,
ray_actor_options={"num_cpus": 2, "num_gpus": 0.5}
)
class ModelServing:
def __init__(self):
# Load model once per replica
self.model = load_model("s3://models/production-model.pt")
async def __call__(self, request):
data = await request.json()
inputs = preprocess(data)
predictions = self.model(inputs)
return {"predictions": predictions.tolist()}
# Deploy
ModelServing.deploy()
This deployment creates 2 replicas of your model across the Ray cluster. Each replica uses 0.5 GPUs (GPU sharing for cost efficiency). Ray Serve automatically load balances requests across replicas and scales replicas based on load.
Autoscaling Serving Deployments
Configure autoscaling to handle variable traffic:
@serve.deployment(
autoscaling_config={
"min_replicas": 1,
"max_replicas": 10,
"target_num_ongoing_requests_per_replica": 5
},
ray_actor_options={"num_cpus": 2, "num_gpus": 0.5}
)
class ModelServing:
# ... implementation
When average requests per replica exceeds 5, Ray Serve adds replicas. When requests drop, replicas scale down. Combined with Kubernetes autoscaling of Ray worker nodes, this provides end-to-end elasticity from HTTP requests to infrastructure.
Multi-Model Serving and Composition
Serve multiple models or compose them into pipelines:
@serve.deployment
class Preprocessor:
def __call__(self, data):
return preprocess(data)
@serve.deployment(ray_actor_options={"num_gpus": 1})
class ModelA:
def __init__(self):
self.model = load_model("model_a.pt")
def __call__(self, data):
return self.model(data)
@serve.deployment(ray_actor_options={"num_gpus": 1})
class ModelB:
def __init__(self):
self.model = load_model("model_b.pt")
def __call__(self, data):
return self.model(data)
@serve.deployment
class Ensemble:
def __init__(self, preprocessor, model_a, model_b):
self.preprocessor = preprocessor
self.model_a = model_a
self.model_b = model_b
async def __call__(self, request):
data = await request.json()
# Preprocess
processed = await self.preprocessor.remote(data)
# Run models in parallel
result_a, result_b = await asyncio.gather(
self.model_a.remote(processed),
self.model_b.remote(processed)
)
# Ensemble predictions
final = ensemble_predictions(result_a, result_b)
return {"prediction": final}
# Deploy pipeline
preprocessor = Preprocessor.bind()
model_a = ModelA.bind()
model_b = ModelB.bind()
ensemble = Ensemble.bind(preprocessor, model_a, model_b)
serve.run(ensemble)
This pipeline preprocesses once, runs two models in parallel, and ensembles results—all with automatic load balancing and autoscaling.
Job Submission and Workflow Orchestration
Ray Jobs API provides a way to submit workloads to Ray clusters on Kubernetes, enabling integration with CI/CD pipelines and workflow orchestration tools.
Submitting Jobs to Kubernetes-Hosted Ray Clusters
From outside the cluster, submit jobs via Ray Jobs API:
from ray.job_submission import JobSubmissionClient
# Connect to Ray cluster via Kubernetes service
client = JobSubmissionClient("http://ml-training-cluster-head:8265")
# Submit training job
job_id = client.submit_job(
entrypoint="python train.py --epochs 100 --lr 0.001",
runtime_env={
"working_dir": "./",
"pip": ["torch==2.0.0", "transformers==4.30.0"]
}
)
print(f"Submitted job {job_id}")
# Monitor job status
status = client.get_job_status(job_id)
logs = client.get_job_logs(job_id)
This pattern enables scheduled training (via Kubernetes CronJobs), CI/CD integration (triggering training from GitHub Actions), and workflow orchestration (using Airflow or Argo to coordinate multi-stage ML pipelines).
Workflow Patterns
Common workflow patterns on Ray + Kubernetes:
Nightly Model Retraining: CronJob triggers Ray job to retrain model on latest data, validate, and deploy if metrics improve
Feature Engineering Pipeline: Argo Workflow orchestrates Ray jobs for data extraction, feature computation, and feature store updates
Experiment Tracking: CI pipeline submits Ray Tune job for hyperparameter search, logs results to MLflow, and notifies team of best configuration
A/B Test Deployment: GitOps workflow updates Ray Serve deployment to route 10% traffic to new model version, monitors metrics, and promotes or rolls back
These patterns leverage Kubernetes for workflow orchestration while Ray handles the distributed ML computation.
Monitoring and Observability
Production AI platforms require comprehensive monitoring to ensure reliability and efficiency.
Ray Dashboard
Ray provides a built-in dashboard showing:
- Cluster resource utilization (CPU, GPU, memory)
- Active tasks and their resource consumption
- Worker node status and health
- Job submission history and logs
Expose the dashboard via Kubernetes ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ray-dashboard
namespace: ml-platform
spec:
rules:
- host: ray-dashboard.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ml-training-cluster-head
port:
number: 8265
Prometheus Integration
Export Ray metrics to Prometheus for long-term storage and alerting:
# Add to RayCluster head container
env:
- name: RAY_PROMETHEUS_HOST
value: "0.0.0.0"
- name: RAY_PROMETHEUS_PORT
value: "8080"
# ServiceMonitor for Prometheus scraping
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ray-metrics
spec:
selector:
matchLabels:
app: ray-cluster
endpoints:
- port: metrics
interval: 30s
Key metrics to monitor:
- GPU Utilization: Ensure expensive GPU resources are fully utilized
- Task Queue Depth: High queues indicate resource constraints
- Worker Failures: Frequent failures suggest infrastructure issues
- Job Duration: Track training time trends for performance regression
Logging Strategy
Aggregate logs from Ray pods using Kubernetes logging infrastructure:
# FluentBit DaemonSet collects logs
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
filter.conf: |
[FILTER]
Name kubernetes
Match kube.*
Kube_Tag_Prefix kube.var.log.containers.
[FILTER]
Name grep
Match kube.*
Regex $kubernetes['labels']['app'] ray-cluster
Forward logs to Elasticsearch, CloudWatch, or your logging platform for analysis and debugging.
Conclusion
Orchestrating AI workloads using Ray and Kubernetes combines the distributed computing capabilities purpose-built for machine learning with the infrastructure management maturity of the cloud-native ecosystem. Ray provides the application-level intelligence to schedule training across GPUs, optimize hyperparameter searches, and serve models efficiently, while Kubernetes handles the infrastructure concerns of container lifecycle, node management, and elastic scaling. This architectural pattern enables ML teams to build production AI platforms that scale from experimentation to hyperscale deployment, handling distributed training, hyperparameter tuning, and model serving through a unified framework with consistent APIs and operational practices.
The patterns covered—heterogeneous worker groups for different workload types, Ray Train for distributed training with automatic gradient synchronization, Ray Tune for scalable hyperparameter optimization, Ray Serve for elastic model serving, and job submission for workflow integration—provide a comprehensive toolkit for modern ML engineering. As AI workloads continue growing in scale and complexity, the Ray and Kubernetes combination will remain essential infrastructure, enabling organizations to efficiently utilize expensive GPU resources, accelerate model development through parallelization, and deliver AI applications that scale reliably. Mastering this orchestration stack is increasingly fundamental to building competitive AI products.