AI Workload Orchestration Using Ray and Kubernetes

When you’re scaling AI and machine learning workloads beyond a single machine, the complexity of distributed computing quickly becomes overwhelming. Managing distributed training across multiple GPUs, coordinating hyperparameter tuning experiments, serving models at scale, and orchestrating data preprocessing pipelines all require sophisticated infrastructure. Ray and Kubernetes have emerged as the dominant combination for AI workload orchestration—Ray provides the distributed computing framework specifically designed for ML workloads, while Kubernetes handles the underlying infrastructure orchestration. Together, they enable you to build scalable AI platforms that efficiently utilize resources, automatically recover from failures, and adapt to changing workload demands. Understanding how to leverage both technologies together is essential for modern ML engineering teams building production-grade AI systems.

Why Ray and Kubernetes Complement Each Other

Before diving into implementation patterns, it’s crucial to understand what each technology provides and why using them together creates a more powerful platform than either alone.

What Ray Brings to AI Workloads

Ray is a distributed computing framework designed from the ground up for AI and ML workloads. It provides several capabilities that make distributed ML dramatically easier:

Ray’s core abstraction is the task—a Python function that can execute on any available node in the cluster. Unlike traditional distributed computing frameworks that require complex setup, Ray tasks are defined with a simple decorator. Ray handles scheduling these tasks across your cluster, managing data transfer between nodes, and recovering from failures.

Ray includes specialized libraries for common ML patterns: Ray Train for distributed training (supporting PyTorch, TensorFlow, XGBoost), Ray Tune for hyperparameter optimization, Ray Serve for model serving, and Ray Data for scalable data processing. These libraries understand ML workload characteristics—they know about GPU placement, gradient synchronization, model checkpointing, and distributed data loading.

Resource management in Ray is ML-aware. You can specify that a task needs 2 GPUs and 16 CPUs, and Ray schedules it only on nodes with available resources. Ray understands fractional GPU sharing, gang scheduling (allocating resources for all workers simultaneously), and placement groups (colocating related tasks).

What Kubernetes Provides

Kubernetes excels at infrastructure orchestration. It manages the lifecycle of containers, handles node failures by rescheduling pods, provides load balancing and service discovery, and offers declarative configuration for desired cluster state.

For Ray workloads, Kubernetes provides:

Automatic scaling of Ray cluster size based on workload demands
Health monitoring and automatic restart of failed Ray nodes
Isolation between different teams’ Ray clusters through namespaces
Integration with cloud provider features (storage, networking, IAM)
Standardized deployment patterns across different environments

The Synergy

Ray provides the application-level orchestration—deciding which tasks run where, managing distributed state, and coordinating ML-specific patterns. Kubernetes provides infrastructure-level orchestration—ensuring containers are running, scaling resources up and down, and managing failures at the node level.

This separation of concerns is powerful. Ray doesn’t need to implement container lifecycle management, node provisioning, or cloud provider integration. Kubernetes doesn’t need to understand gradient synchronization, hyperparameter search spaces, or model serving patterns. Each does what it does best.

Ray + Kubernetes: Division of Responsibilities

Ray Handles: Task scheduling, distributed state, ML-specific patterns (training, tuning, serving), resource allocation within cluster

Kubernetes Handles: Container lifecycle, node management, auto-scaling, networking, storage, failure recovery

Together They Provide: Elastic ML platform that scales resources dynamically, recovers from failures automatically, and efficiently utilizes GPUs

Deploying Ray on Kubernetes: Architecture and Patterns

The KubeRay operator is the standard way to deploy and manage Ray clusters on Kubernetes. Understanding its architecture and deployment patterns is essential for building reliable AI platforms.

KubeRay Operator Architecture

KubeRay introduces custom Kubernetes resources (CRDs) for Ray clusters. When you create a RayCluster resource, the KubeRay operator watches for this resource and creates the necessary Kubernetes pods, services, and configurations to run a Ray cluster.

A Ray cluster consists of:

Head Node: Single pod running the Ray head process, which manages cluster state, schedules tasks, and provides the entry point for submitting jobs
Worker Nodes: Multiple pods running Ray worker processes, which execute tasks and hold distributed state
Services: Kubernetes services enabling communication between Ray components and exposing endpoints for job submission

The KubeRay operator continuously reconciles desired state (defined in your RayCluster resource) with actual state, creating new pods when workers are needed and removing them when not.

Basic RayCluster Configuration

Here’s a foundational RayCluster configuration for ML workloads:

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: ml-training-cluster
  namespace: ml-platform
spec:
  # Ray version
  rayVersion: '2.9.0'
  
  # Head node configuration
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
      num-cpus: '0'  # Don't schedule tasks on head
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-ml:2.9.0-py310
          ports:
          - containerPort: 6379  # Redis
            name: gcs
          - containerPort: 8265  # Dashboard
            name: dashboard
          - containerPort: 10001 # Client
            name: client
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
            limits:
              cpu: "2"
              memory: "8Gi"
          volumeMounts:
          - name: ray-logs
            mountPath: /tmp/ray
        volumes:
        - name: ray-logs
          emptyDir: {}
  
  # Worker node groups
  workerGroupSpecs:
  # CPU worker group
  - groupName: cpu-workers
    replicas: 3
    minReplicas: 1
    maxReplicas: 10
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.9.0-py310
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
            limits:
              cpu: "4"
              memory: "16Gi"
  
  # GPU worker group for training
  - groupName: gpu-workers
    replicas: 2
    minReplicas: 0
    maxReplicas: 5
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.9.0-py310-gpu
          resources:
            requests:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-tesla-v100

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: ml-training-cluster
  namespace: ml-platform
spec:
  # Ray version
  rayVersion: '2.9.0'
  
  # Head node configuration
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
      num-cpus: '0'  # Don't schedule tasks on head
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-ml:2.9.0-py310
          ports:
          - containerPort: 6379  # Redis
            name: gcs
          - containerPort: 8265  # Dashboard
            name: dashboard
          - containerPort: 10001 # Client
            name: client
          resources:
            requests:
              cpu: "2"
              memory: "8Gi"
            limits:
              cpu: "2"
              memory: "8Gi"
          volumeMounts:
          - name: ray-logs
            mountPath: /tmp/ray
        volumes:
        - name: ray-logs
          emptyDir: {}
  
  # Worker node groups
  workerGroupSpecs:
  # CPU worker group
  - groupName: cpu-workers
    replicas: 3
    minReplicas: 1
    maxReplicas: 10
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.9.0-py310
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
            limits:
              cpu: "4"
              memory: "16Gi"
  
  # GPU worker group for training
  - groupName: gpu-workers
    replicas: 2
    minReplicas: 0
    maxReplicas: 5
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.9.0-py310-gpu
          resources:
            requests:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-tesla-v100

This configuration creates a Ray cluster with heterogeneous worker types—CPU workers for data processing and GPU workers for training. The minReplicas and maxReplicas enable autoscaling, adding or removing workers based on workload demands.

Autoscaling Configuration

Ray’s autoscaler integrates with Kubernetes to dynamically adjust cluster size:

# Add to RayCluster spec
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 60
    resources:
      - name: CPU
        minReplicas: 2
        maxReplicas: 10
      - name: GPU
        minReplicas: 0
        maxReplicas: 5

# Add to RayCluster spec
  enableInTreeAutoscaling: true
  autoscalerOptions:
    upscalingMode: Default
    idleTimeoutSeconds: 60
    resources:
      - name: CPU
        minReplicas: 2
        maxReplicas: 10
      - name: GPU
        minReplicas: 0
        maxReplicas: 5

When Ray tasks request resources that aren’t available, the autoscaler requests additional worker pods from Kubernetes. When workers are idle for the timeout period, they’re scaled down. This elasticity is crucial for cost efficiency—you only pay for resources when actively processing workloads.

Orchestrating Distributed Training Workloads

Distributed training is one of the most resource-intensive AI workloads, and Ray’s integration with Kubernetes makes it significantly more manageable.

Ray Train for Multi-GPU Training

Ray Train provides a simple API for distributed training that handles the complexity of multi-node, multi-GPU synchronization:

import ray
from ray import train
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
import torch
import torch.nn as nn

def train_func(config):
    """Training function that runs on each worker"""
    # Get distributed training context
    rank = train.get_context().get_world_rank()
    
    # Create model
    model = nn.Sequential(
        nn.Linear(784, 512),
        nn.ReLU(),
        nn.Linear(512, 10)
    )
    model = train.torch.prepare_model(model)  # Wrap for distributed
    
    # Create optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])
    
    # Training loop
    for epoch in range(config["num_epochs"]):
        # Load data (Ray Train handles data distribution)
        dataloader = get_dataloader(config["batch_size"])
        dataloader = train.torch.prepare_data_loader(dataloader)
        
        for batch_idx, (data, target) in enumerate(dataloader):
            optimizer.zero_grad()
            output = model(data)
            loss = nn.functional.cross_entropy(output, target)
            loss.backward()
            optimizer.step()
            
            # Report metrics (aggregated across workers)
            if batch_idx % 100 == 0:
                train.report({"loss": loss.item(), "epoch": epoch})

# Configure distributed training
trainer = TorchTrainer(
    train_func,
    train_loop_config={
        "lr": 0.001,
        "batch_size": 64,
        "num_epochs": 10
    },
    scaling_config=ScalingConfig(
        num_workers=4,  # 4 GPU workers
        use_gpu=True,
        resources_per_worker={"CPU": 4, "GPU": 1}
    )
)

# Connect to Ray cluster on Kubernetes
ray.init(address="ray://ml-training-cluster-head:10001")

# Run training
result = trainer.fit()
print(f"Training completed. Final loss: {result.metrics['loss']}")

import ray
from ray import train
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
import torch
import torch.nn as nn

def train_func(config):
    """Training function that runs on each worker"""
    # Get distributed training context
    rank = train.get_context().get_world_rank()
    
    # Create model
    model = nn.Sequential(
        nn.Linear(784, 512),
        nn.ReLU(),
        nn.Linear(512, 10)
    )
    model = train.torch.prepare_model(model)  # Wrap for distributed
    
    # Create optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])
    
    # Training loop
    for epoch in range(config["num_epochs"]):
        # Load data (Ray Train handles data distribution)
        dataloader = get_dataloader(config["batch_size"])
        dataloader = train.torch.prepare_data_loader(dataloader)
        
        for batch_idx, (data, target) in enumerate(dataloader):
            optimizer.zero_grad()
            output = model(data)
            loss = nn.functional.cross_entropy(output, target)
            loss.backward()
            optimizer.step()
            
            # Report metrics (aggregated across workers)
            if batch_idx % 100 == 0:
                train.report({"loss": loss.item(), "epoch": epoch})

# Configure distributed training
trainer = TorchTrainer(
    train_func,
    train_loop_config={
        "lr": 0.001,
        "batch_size": 64,
        "num_epochs": 10
    },
    scaling_config=ScalingConfig(
        num_workers=4,  # 4 GPU workers
        use_gpu=True,
        resources_per_worker={"CPU": 4, "GPU": 1}
    )
)

# Connect to Ray cluster on Kubernetes
ray.init(address="ray://ml-training-cluster-head:10001")

# Run training
result = trainer.fit()
print(f"Training completed. Final loss: {result.metrics['loss']}")

This code runs seamlessly whether on a local machine or a Kubernetes-hosted Ray cluster with 100 GPUs. Ray Train handles:

Data distribution across workers
Gradient synchronization between GPUs
Fault tolerance—if a worker fails, training can continue
Metric aggregation and reporting
Checkpoint saving for model recovery

Resource Allocation Strategies

Different training strategies require different resource allocation patterns:

Data Parallel Training: Multiple workers train on different data batches with the same model. Allocate 1 GPU per worker, typically 4-8 CPUs per GPU for data loading:

ScalingConfig(
    num_workers=8,
    use_gpu=True,
    resources_per_worker={"CPU": 8, "GPU": 1}
)

ScalingConfig(
    num_workers=8,
    use_gpu=True,
    resources_per_worker={"CPU": 8, "GPU": 1}
)

Model Parallel Training: Large models split across multiple GPUs. Allocate multiple GPUs per worker:

ScalingConfig(
    num_workers=2,  # 2 model replicas
    use_gpu=True,
    resources_per_worker={"CPU": 16, "GPU": 4}  # Each replica uses 4 GPUs
)

ScalingConfig(
    num_workers=2,  # 2 model replicas
    use_gpu=True,
    resources_per_worker={"CPU": 16, "GPU": 4}  # Each replica uses 4 GPUs
)

Pipeline Parallel Training: Model stages on different GPUs. Use placement groups to ensure proper GPU allocation:

from ray.util.placement_group import placement_group

# Create placement group ensuring GPUs are on same node
pg = placement_group([{"GPU": 4, "CPU": 16}] * 2)  # 2 nodes with 4 GPUs each

ScalingConfig(
    num_workers=8,
    use_gpu=True,
    placement_strategy=pg
)

from ray.util.placement_group import placement_group

# Create placement group ensuring GPUs are on same node
pg = placement_group([{"GPU": 4, "CPU": 16}] * 2)  # 2 nodes with 4 GPUs each

ScalingConfig(
    num_workers=8,
    use_gpu=True,
    placement_strategy=pg
)

Ray’s resource management ensures these allocations are satisfied before starting training, preventing partial allocations that would cause failures.

Hyperparameter Tuning at Scale

Hyperparameter optimization often requires running dozens or hundreds of training trials. Ray Tune provides efficient orchestration for this workload pattern.

Ray Tune Architecture

Ray Tune schedules multiple training trials across your Ray cluster, dynamically allocating resources and implementing sophisticated algorithms like population-based training or Bayesian optimization. On Kubernetes, this means trials automatically scale across available nodes, utilizing GPUs efficiently.

Implementing Scalable Hyperparameter Search

from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch

def trainable(config):
    """Single trial training function"""
    # Setup model with config hyperparameters
    model = create_model(
        hidden_size=config["hidden_size"],
        num_layers=config["num_layers"],
        dropout=config["dropout"]
    )
    optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])
    
    # Training loop with metrics reporting
    for epoch in range(10):
        train_loss = train_epoch(model, optimizer)
        val_loss = validate(model)
        
        # Report intermediate results for early stopping
        tune.report(val_loss=val_loss, train_loss=train_loss)

# Configure search space
search_space = {
    "lr": tune.loguniform(1e-5, 1e-1),
    "hidden_size": tune.choice([128, 256, 512, 1024]),
    "num_layers": tune.randint(2, 6),
    "dropout": tune.uniform(0.1, 0.5)
}

# ASHA scheduler for early stopping of poor trials
scheduler = ASHAScheduler(
    max_t=10,  # Max epochs per trial
    grace_period=1,  # Min epochs before stopping
    reduction_factor=2
)

# Bayesian optimization for intelligent search
search_alg = BayesOptSearch(
    metric="val_loss",
    mode="min"
)

# Run tuning across Ray cluster
ray.init(address="ray://ml-training-cluster-head:10001")

analysis = tune.run(
    trainable,
    config=search_space,
    num_samples=100,  # 100 trials
    scheduler=scheduler,
    search_alg=search_alg,
    resources_per_trial={"cpu": 4, "gpu": 1},
    verbose=1
)

# Get best configuration
best_config = analysis.get_best_config(metric="val_loss", mode="min")
print(f"Best hyperparameters: {best_config}")

from ray import tune
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch

def trainable(config):
    """Single trial training function"""
    # Setup model with config hyperparameters
    model = create_model(
        hidden_size=config["hidden_size"],
        num_layers=config["num_layers"],
        dropout=config["dropout"]
    )
    optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])
    
    # Training loop with metrics reporting
    for epoch in range(10):
        train_loss = train_epoch(model, optimizer)
        val_loss = validate(model)
        
        # Report intermediate results for early stopping
        tune.report(val_loss=val_loss, train_loss=train_loss)

# Configure search space
search_space = {
    "lr": tune.loguniform(1e-5, 1e-1),
    "hidden_size": tune.choice([128, 256, 512, 1024]),
    "num_layers": tune.randint(2, 6),
    "dropout": tune.uniform(0.1, 0.5)
}

# ASHA scheduler for early stopping of poor trials
scheduler = ASHAScheduler(
    max_t=10,  # Max epochs per trial
    grace_period=1,  # Min epochs before stopping
    reduction_factor=2
)

# Bayesian optimization for intelligent search
search_alg = BayesOptSearch(
    metric="val_loss",
    mode="min"
)

# Run tuning across Ray cluster
ray.init(address="ray://ml-training-cluster-head:10001")

analysis = tune.run(
    trainable,
    config=search_space,
    num_samples=100,  # 100 trials
    scheduler=scheduler,
    search_alg=search_alg,
    resources_per_trial={"cpu": 4, "gpu": 1},
    verbose=1
)

# Get best configuration
best_config = analysis.get_best_config(metric="val_loss", mode="min")
print(f"Best hyperparameters: {best_config}")

This search runs 100 trials across your Kubernetes Ray cluster. If you have 10 GPUs available, Ray Tune runs 10 trials concurrently. The ASHA scheduler stops unpromising trials early, freeing resources for new trials. Bayesian optimization focuses search on promising regions of hyperparameter space.

The key advantage on Kubernetes: as your tuning job requests more resources (many concurrent trials), the Ray autoscaler can add worker pods, scaling your cluster to match workload demands. When tuning completes, workers scale down automatically.

Best Practices for Ray + Kubernetes

Resource Requests = Limits: Set equal values to ensure predictable scheduling and avoid node oversubscription
Persistent Storage: Use PersistentVolumes for checkpoints and logs to survive pod restarts
Health Checks: Configure liveness and readiness probes for reliable failure detection
Network Policies: Isolate Ray clusters with Kubernetes network policies for security
Resource Quotas: Prevent runaway autoscaling with namespace resource quotas
Monitoring: Export Ray metrics to Prometheus for visibility into cluster utilization

Serving ML Models with Ray Serve on Kubernetes

Once models are trained, serving them at scale requires orchestration that handles traffic routing, autoscaling, and version management. Ray Serve on Kubernetes provides this capability.

Ray Serve Architecture

Ray Serve is a model serving framework that runs on top of Ray. It provides:

HTTP/gRPC endpoints for model inference
Autoscaling based on request load
Model composition (chaining multiple models)
A/B testing and canary deployments
Batching for throughput optimization

On Kubernetes, Ray Serve deployments are Ray clusters configured for serving workloads rather than training.

Deploying Models with Ray Serve

from ray import serve
import ray

# Connect to Ray cluster
ray.init(address="ray://serving-cluster-head:10001")

# Start Ray Serve
serve.start(http_options={"host": "0.0.0.0", "port": 8000})

@serve.deployment(
    num_replicas=2,
    ray_actor_options={"num_cpus": 2, "num_gpus": 0.5}
)
class ModelServing:
    def __init__(self):
        # Load model once per replica
        self.model = load_model("s3://models/production-model.pt")
    
    async def __call__(self, request):
        data = await request.json()
        inputs = preprocess(data)
        predictions = self.model(inputs)
        return {"predictions": predictions.tolist()}

# Deploy
ModelServing.deploy()

from ray import serve
import ray

# Connect to Ray cluster
ray.init(address="ray://serving-cluster-head:10001")

# Start Ray Serve
serve.start(http_options={"host": "0.0.0.0", "port": 8000})

@serve.deployment(
    num_replicas=2,
    ray_actor_options={"num_cpus": 2, "num_gpus": 0.5}
)
class ModelServing:
    def __init__(self):
        # Load model once per replica
        self.model = load_model("s3://models/production-model.pt")
    
    async def __call__(self, request):
        data = await request.json()
        inputs = preprocess(data)
        predictions = self.model(inputs)
        return {"predictions": predictions.tolist()}

# Deploy
ModelServing.deploy()

This deployment creates 2 replicas of your model across the Ray cluster. Each replica uses 0.5 GPUs (GPU sharing for cost efficiency). Ray Serve automatically load balances requests across replicas and scales replicas based on load.

Autoscaling Serving Deployments

Configure autoscaling to handle variable traffic:

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_num_ongoing_requests_per_replica": 5
    },
    ray_actor_options={"num_cpus": 2, "num_gpus": 0.5}
)
class ModelServing:
    # ... implementation

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_num_ongoing_requests_per_replica": 5
    },
    ray_actor_options={"num_cpus": 2, "num_gpus": 0.5}
)
class ModelServing:
    # ... implementation

When average requests per replica exceeds 5, Ray Serve adds replicas. When requests drop, replicas scale down. Combined with Kubernetes autoscaling of Ray worker nodes, this provides end-to-end elasticity from HTTP requests to infrastructure.

Multi-Model Serving and Composition

Serve multiple models or compose them into pipelines:

@serve.deployment
class Preprocessor:
    def __call__(self, data):
        return preprocess(data)

@serve.deployment(ray_actor_options={"num_gpus": 1})
class ModelA:
    def __init__(self):
        self.model = load_model("model_a.pt")
    
    def __call__(self, data):
        return self.model(data)

@serve.deployment(ray_actor_options={"num_gpus": 1})
class ModelB:
    def __init__(self):
        self.model = load_model("model_b.pt")
    
    def __call__(self, data):
        return self.model(data)

@serve.deployment
class Ensemble:
    def __init__(self, preprocessor, model_a, model_b):
        self.preprocessor = preprocessor
        self.model_a = model_a
        self.model_b = model_b
    
    async def __call__(self, request):
        data = await request.json()
        
        # Preprocess
        processed = await self.preprocessor.remote(data)
        
        # Run models in parallel
        result_a, result_b = await asyncio.gather(
            self.model_a.remote(processed),
            self.model_b.remote(processed)
        )
        
        # Ensemble predictions
        final = ensemble_predictions(result_a, result_b)
        return {"prediction": final}

# Deploy pipeline
preprocessor = Preprocessor.bind()
model_a = ModelA.bind()
model_b = ModelB.bind()
ensemble = Ensemble.bind(preprocessor, model_a, model_b)
serve.run(ensemble)

@serve.deployment
class Preprocessor:
    def __call__(self, data):
        return preprocess(data)

@serve.deployment(ray_actor_options={"num_gpus": 1})
class ModelA:
    def __init__(self):
        self.model = load_model("model_a.pt")
    
    def __call__(self, data):
        return self.model(data)

@serve.deployment(ray_actor_options={"num_gpus": 1})
class ModelB:
    def __init__(self):
        self.model = load_model("model_b.pt")
    
    def __call__(self, data):
        return self.model(data)

@serve.deployment
class Ensemble:
    def __init__(self, preprocessor, model_a, model_b):
        self.preprocessor = preprocessor
        self.model_a = model_a
        self.model_b = model_b
    
    async def __call__(self, request):
        data = await request.json()
        
        # Preprocess
        processed = await self.preprocessor.remote(data)
        
        # Run models in parallel
        result_a, result_b = await asyncio.gather(
            self.model_a.remote(processed),
            self.model_b.remote(processed)
        )
        
        # Ensemble predictions
        final = ensemble_predictions(result_a, result_b)
        return {"prediction": final}

# Deploy pipeline
preprocessor = Preprocessor.bind()
model_a = ModelA.bind()
model_b = ModelB.bind()
ensemble = Ensemble.bind(preprocessor, model_a, model_b)
serve.run(ensemble)

This pipeline preprocesses once, runs two models in parallel, and ensembles results—all with automatic load balancing and autoscaling.

Job Submission and Workflow Orchestration

Ray Jobs API provides a way to submit workloads to Ray clusters on Kubernetes, enabling integration with CI/CD pipelines and workflow orchestration tools.

Submitting Jobs to Kubernetes-Hosted Ray Clusters

From outside the cluster, submit jobs via Ray Jobs API:

from ray.job_submission import JobSubmissionClient

# Connect to Ray cluster via Kubernetes service
client = JobSubmissionClient("http://ml-training-cluster-head:8265")

# Submit training job
job_id = client.submit_job(
    entrypoint="python train.py --epochs 100 --lr 0.001",
    runtime_env={
        "working_dir": "./",
        "pip": ["torch==2.0.0", "transformers==4.30.0"]
    }
)

print(f"Submitted job {job_id}")

# Monitor job status
status = client.get_job_status(job_id)
logs = client.get_job_logs(job_id)

from ray.job_submission import JobSubmissionClient

# Connect to Ray cluster via Kubernetes service
client = JobSubmissionClient("http://ml-training-cluster-head:8265")

# Submit training job
job_id = client.submit_job(
    entrypoint="python train.py --epochs 100 --lr 0.001",
    runtime_env={
        "working_dir": "./",
        "pip": ["torch==2.0.0", "transformers==4.30.0"]
    }
)

print(f"Submitted job {job_id}")

# Monitor job status
status = client.get_job_status(job_id)
logs = client.get_job_logs(job_id)

This pattern enables scheduled training (via Kubernetes CronJobs), CI/CD integration (triggering training from GitHub Actions), and workflow orchestration (using Airflow or Argo to coordinate multi-stage ML pipelines).

Workflow Patterns

Common workflow patterns on Ray + Kubernetes:

Nightly Model Retraining: CronJob triggers Ray job to retrain model on latest data, validate, and deploy if metrics improve

Feature Engineering Pipeline: Argo Workflow orchestrates Ray jobs for data extraction, feature computation, and feature store updates

Experiment Tracking: CI pipeline submits Ray Tune job for hyperparameter search, logs results to MLflow, and notifies team of best configuration

A/B Test Deployment: GitOps workflow updates Ray Serve deployment to route 10% traffic to new model version, monitors metrics, and promotes or rolls back

These patterns leverage Kubernetes for workflow orchestration while Ray handles the distributed ML computation.

Monitoring and Observability

Production AI platforms require comprehensive monitoring to ensure reliability and efficiency.

Ray Dashboard

Ray provides a built-in dashboard showing:

Cluster resource utilization (CPU, GPU, memory)
Active tasks and their resource consumption
Worker node status and health
Job submission history and logs

Expose the dashboard via Kubernetes ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ray-dashboard
  namespace: ml-platform
spec:
  rules:
  - host: ray-dashboard.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ml-training-cluster-head
            port:
              number: 8265

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ray-dashboard
  namespace: ml-platform
spec:
  rules:
  - host: ray-dashboard.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ml-training-cluster-head
            port:
              number: 8265

Prometheus Integration

Export Ray metrics to Prometheus for long-term storage and alerting:

# Add to RayCluster head container
env:
- name: RAY_PROMETHEUS_HOST
  value: "0.0.0.0"
- name: RAY_PROMETHEUS_PORT
  value: "8080"

# ServiceMonitor for Prometheus scraping
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ray-metrics
spec:
  selector:
    matchLabels:
      app: ray-cluster
  endpoints:
  - port: metrics
    interval: 30s

# Add to RayCluster head container
env:
- name: RAY_PROMETHEUS_HOST
  value: "0.0.0.0"
- name: RAY_PROMETHEUS_PORT
  value: "8080"

# ServiceMonitor for Prometheus scraping
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ray-metrics
spec:
  selector:
    matchLabels:
      app: ray-cluster
  endpoints:
  - port: metrics
    interval: 30s

Key metrics to monitor:

GPU Utilization: Ensure expensive GPU resources are fully utilized
Task Queue Depth: High queues indicate resource constraints
Worker Failures: Frequent failures suggest infrastructure issues
Job Duration: Track training time trends for performance regression

Logging Strategy

Aggregate logs from Ray pods using Kubernetes logging infrastructure:

# FluentBit DaemonSet collects logs
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  filter.conf: |
    [FILTER]
        Name kubernetes
        Match kube.*
        Kube_Tag_Prefix kube.var.log.containers.
        
    [FILTER]
        Name grep
        Match kube.*
        Regex $kubernetes['labels']['app'] ray-cluster

# FluentBit DaemonSet collects logs
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
data:
  filter.conf: |
    [FILTER]
        Name kubernetes
        Match kube.*
        Kube_Tag_Prefix kube.var.log.containers.
        
    [FILTER]
        Name grep
        Match kube.*
        Regex $kubernetes['labels']['app'] ray-cluster

Forward logs to Elasticsearch, CloudWatch, or your logging platform for analysis and debugging.

Conclusion

Orchestrating AI workloads using Ray and Kubernetes combines the distributed computing capabilities purpose-built for machine learning with the infrastructure management maturity of the cloud-native ecosystem. Ray provides the application-level intelligence to schedule training across GPUs, optimize hyperparameter searches, and serve models efficiently, while Kubernetes handles the infrastructure concerns of container lifecycle, node management, and elastic scaling. This architectural pattern enables ML teams to build production AI platforms that scale from experimentation to hyperscale deployment, handling distributed training, hyperparameter tuning, and model serving through a unified framework with consistent APIs and operational practices.

The patterns covered—heterogeneous worker groups for different workload types, Ray Train for distributed training with automatic gradient synchronization, Ray Tune for scalable hyperparameter optimization, Ray Serve for elastic model serving, and job submission for workflow integration—provide a comprehensive toolkit for modern ML engineering. As AI workloads continue growing in scale and complexity, the Ray and Kubernetes combination will remain essential infrastructure, enabling organizations to efficiently utilize expensive GPU resources, accelerate model development through parallelization, and deliver AI applications that scale reliably. Mastering this orchestration stack is increasingly fundamental to building competitive AI products.