ML Journey

Feature Store Design Patterns for Small Data Teams

December 6, 2025 by Peter Song

Feature stores have emerged as critical infrastructure in production machine learning, promising to solve the twin challenges of training-serving skew and feature reusability across projects. Yet the canonical implementations—Feast, Tecton, or custom systems built at Uber and Airbnb—assume resources that small data teams simply don’t have: dedicated MLOps engineers, managed Kubernetes clusters, real-time streaming infrastructure, … Read more

How to Tune CatBoost Models for Structured E-commerce Data

December 6, 2025 by Peter Song

CatBoost has emerged as the gradient boosting algorithm of choice for e-commerce practitioners working with structured data, and for good reason. Its native handling of categorical features eliminates the preprocessing headaches that plague other algorithms, its ordered boosting reduces overfitting on the noisy conversion signals typical in retail, and its GPU acceleration makes it practical … Read more

Nonlinear Dimensionality Reduction for High-Noise Datasets

December 6, 2025 by Peter Song

High-dimensional data presents a fundamental challenge in machine learning and data science: when datasets contain hundreds or thousands of features, visualization becomes impossible, computation becomes expensive, and the curse of dimensionality causes many algorithms to fail. Dimensionality reduction techniques offer a solution by projecting data into lower dimensions while preserving important structure. However, when your … Read more

Gradient Noise Scale and Batch Size Relationship

December 6, 2025 by Peter Song

When training neural networks, practitioners face a fundamental question that significantly impacts both model quality and training efficiency: what batch size should I use? The answer isn’t simply “as large as your GPU memory allows” or “stick with the default.” The relationship between batch size and gradient noise scale reveals deep insights into the optimization … Read more

How to Use AWS Data Pipeline for Machine Learning

December 6, 2025 by Peter Song

Machine learning workflows are inherently data-intensive, requiring orchestration of complex sequences: data extraction from multiple sources, transformation and cleaning, feature engineering, model training, validation, and deployment. Managing these workflows manually quickly becomes unsustainable as complexity grows. AWS Data Pipeline, a web service for orchestrating and automating data movement and transformation, provides infrastructure for building reliable, … Read more

Real-Time Prediction Pipelines Using Kafka and Python

December 6, 2025 by Peter Song

The demand for real-time machine learning predictions has transformed from a competitive advantage into a business necessity. Whether detecting fraudulent transactions within milliseconds, personalizing content as users browse, or predicting equipment failures before they occur, organizations require prediction systems that process streaming data and deliver results in real-time. Building these systems requires combining stream processing … Read more

How to Validate Geo Holdout Experiments Using Synthetic Control Methods

December 6, 2025 by Peter Song

Geographic holdout experiments have become a cornerstone of marketing measurement, allowing companies to estimate the causal impact of advertising campaigns by comparing regions where ads run (treatment) against regions where they don’t (control). Unlike digital A/B tests where individual users can be randomly assigned to treatment and control, geo experiments deal with entire markets—cities, DMAs, … Read more

Naive Bayes Variants: Gaussian vs Multinomial vs Bernoulli

December 6, 2025 by Peter Song

Naive Bayes classifiers are among the most elegant algorithms in machine learning—simple in concept, fast in execution, and surprisingly effective across diverse applications. The “naive” assumption that features are conditionally independent given the class label seems unrealistic, yet in practice, Naive Bayes often performs competitively with far more complex models. However, not all Naive Bayes … Read more

Machine Learning Models for Forecasting Subscription Revenue in Ecommerce

December 6, 2025 by Peter Song

Subscription-based ecommerce businesses live and die by their ability to accurately forecast revenue. Unlike traditional ecommerce where transactions are discrete, subscription models create complex, interdependent patterns involving new customer acquisition, retention rates, upgrade behavior, seasonal churn, and reactivation—all of which must be predicted simultaneously to generate reliable revenue forecasts. Traditional forecasting methods struggle with this … Read more

Fun Data Visualisation Ideas Using Free Datasets

December 6, 2025 by Peter Song

Data visualisation doesn’t have to be dry corporate dashboards and quarterly sales reports. Some of the most engaging, creative, and educational visualisations come from exploring quirky datasets about topics people actually care about—pop culture, sports, food, travel, and the countless fascinating patterns hidden in everyday life. The internet is overflowing with free, high-quality datasets just … Read more