Orchestrating Machine Learning Training Jobs with Airflow and Kubernetes

When you’re moving machine learning models from experimental Jupyter notebooks to production-grade training pipelines, you need robust orchestration that handles complexity, scales with your computational needs, and provides visibility into every step of the process. Apache Airflow combined with Kubernetes offers a powerful solution for orchestrating ML training jobs—Airflow provides workflow management and scheduling, while … Read more

Optimising Spark Jobs: Common Pitfalls and Quick Wins

Apache Spark has become the de facto standard for large-scale data processing, powering everything from ETL pipelines to machine learning workflows. Yet despite its reputation for speed and scalability, poorly optimised Spark jobs can crawl along at a fraction of their potential performance, burning through compute resources while data engineers watch progress bars inch forward. … Read more

Optimizing Parquet Schemas for ML Training Performance

Machine learning training on large datasets has become the bottleneck in modern ML workflows. While practitioners obsess over model architecture and hyperparameters, they often overlook a fundamental performance constraint: how quickly training data can be read from disk and fed into GPUs or CPUs. When training models on terabytes of data stored in Parquet files, … Read more

Best Practices for AWS DMS Monitoring and Logging

AWS Database Migration Service (DMS) has become the go-to solution for migrating databases to AWS, enabling everything from simple lifts-and-shifts to complex heterogeneous migrations and ongoing replication for hybrid architectures. Yet the power of DMS comes with operational complexity—replication tasks can lag, fail silently during full loads, encounter data type conversion errors, or experience network … Read more

How to Use AWS Data Pipeline for Machine Learning

Machine learning workflows are inherently data-intensive, requiring orchestration of complex sequences: data extraction from multiple sources, transformation and cleaning, feature engineering, model training, validation, and deployment. Managing these workflows manually quickly becomes unsustainable as complexity grows. AWS Data Pipeline, a web service for orchestrating and automating data movement and transformation, provides infrastructure for building reliable, … Read more

Real-Time Prediction Pipelines Using Kafka and Python

The demand for real-time machine learning predictions has transformed from a competitive advantage into a business necessity. Whether detecting fraudulent transactions within milliseconds, personalizing content as users browse, or predicting equipment failures before they occur, organizations require prediction systems that process streaming data and deliver results in real-time. Building these systems requires combining stream processing … Read more

Connecting AWS Glue and SageMaker for ML Pipelines

Machine learning pipelines in production require more than just model training. The reality is that data scientists spend roughly 80% of their time on data preparation, transformation, and feature engineering before they can even begin training models. This is where the combination of AWS Glue and Amazon SageMaker becomes transformative. While SageMaker excels at machine … Read more

Monitoring Debezium Connectors for CDC Pipelines

Change Data Capture (CDC) has become the backbone of modern data architectures, enabling real-time data synchronization between operational databases and analytical systems, powering event-driven architectures, and maintaining materialized views across distributed systems. Debezium, as the leading open-source CDC platform, captures row-level changes from databases and streams them to Kafka with minimal latency and exactly-once semantics. … Read more

Deploying Debezium on AWS ECS or Fargate

Debezium’s change data capture capabilities transform databases into event streams, enabling real-time data pipelines, microservices synchronization, and event-driven architectures. While Kafka Connect provides the standard deployment model for Debezium connectors, running this infrastructure on AWS demands careful consideration of container orchestration options. ECS (Elastic Container Service) and Fargate offer distinct approaches to deploying Debezium—ECS provides … Read more

Integrating CockroachDB with Airflow and dbt

Modern data engineering workflows demand robust orchestration, reliable transformations, and databases that can scale with growing data volumes. Integrating CockroachDB with Apache Airflow and dbt (data build tool) creates a powerful stack for building production-grade data pipelines that combine the best of distributed databases, workflow orchestration, and analytics engineering. This integration enables data teams to … Read more