Optimising Spark Jobs: Common Pitfalls and Quick Wins

Apache Spark has become the de facto standard for large-scale data processing, powering everything from ETL pipelines to machine learning workflows. Yet despite its reputation for speed and scalability, poorly optimised Spark jobs can crawl along at a fraction of their potential performance, burning through compute resources while data engineers watch progress bars inch forward. … Read more

ML Ranking Models for Personalised Product Recommendations

In the fiercely competitive landscape of e-commerce, the difference between a user who converts and one who bounces often comes down to a single moment: what products appear in their feed. Machine learning ranking models have evolved from simple collaborative filtering algorithms into sophisticated systems that orchestrate complex signals—user behavior, product attributes, contextual factors, and … Read more

Time-Aware Negative Sampling Strategies for Recommendation Models

In the realm of recommendation systems, the quality of training data fundamentally determines model performance. While positive interactions—items users have clicked, purchased, or enjoyed—are straightforward to collect, negative samples represent a more nuanced challenge. Traditional negative sampling approaches often treat all non-interacted items equally, ignoring a critical dimension: time. Time-aware negative sampling strategies have emerged … Read more

How to Handle Missing Data in Pandas

Missing data is one of the most common and frustrating challenges in data analysis. Whether it’s sensor failures, survey non-responses, data entry errors, or simply information that was never collected, gaps in your dataset can undermine analysis, break machine learning models, and lead to incorrect conclusions. Pandas, Python’s premier data manipulation library, provides a rich … Read more

How to Preprocess Categorical Data in Python

Categorical data—variables representing discrete categories like product types, customer segments, or geographic regions—permeates real-world datasets, yet most machine learning algorithms expect numerical inputs, creating a fundamental preprocessing challenge. Unlike numerical features where values naturally exist on a scale, categorical variables encode qualitative distinctions that require thoughtful transformation into numerical representations that preserve semantic meaning while … Read more

Kernel PCA vs Linear PCA: Strengths and Limits

Principal Component Analysis (PCA) is one of the most widely used dimensionality reduction techniques in machine learning and data analysis. Its ability to compress high-dimensional data into fewer dimensions while retaining maximum variance makes it invaluable for visualization, noise reduction, and preprocessing. However, standard linear PCA has a fundamental limitation: it can only capture linear … Read more

Optimizing Parquet Schemas for ML Training Performance

Machine learning training on large datasets has become the bottleneck in modern ML workflows. While practitioners obsess over model architecture and hyperparameters, they often overlook a fundamental performance constraint: how quickly training data can be read from disk and fed into GPUs or CPUs. When training models on terabytes of data stored in Parquet files, … Read more

L1 vs L2 Regularization Impact on Sparse Feature Models

Regularization is a cornerstone of machine learning model training, preventing overfitting by penalizing model complexity. While most practitioners understand that L1 and L2 regularization serve this goal, the profound differences in how they shape model behavior—especially with sparse feature sets—are often underappreciated. These differences aren’t subtle theoretical curiosities but practical distinctions that determine whether your … Read more

Best Practices for AWS DMS Monitoring and Logging

AWS Database Migration Service (DMS) has become the go-to solution for migrating databases to AWS, enabling everything from simple lifts-and-shifts to complex heterogeneous migrations and ongoing replication for hybrid architectures. Yet the power of DMS comes with operational complexity—replication tasks can lag, fail silently during full loads, encounter data type conversion errors, or experience network … Read more

Best Learning Rate Schedules for Training Deep Neural Networks from Scratch

The learning rate stands as the single most influential hyperparameter in training deep neural networks, yet maintaining a fixed learning rate throughout training represents a fundamentally suboptimal strategy. When training from scratch—without transfer learning or pretrained weights—the optimization landscape changes dramatically as training progresses: early epochs require aggressive exploration with large learning rates to escape … Read more