ML Journey

How to Use Unsupervised Learning to Cluster User Behaviour Events

January 7, 2026 by Peter Song

Understanding how users interact with your application is fundamental to building better products, but raw event logs tell an overwhelming story. When you’re capturing millions of clicks, page views, searches, and transactions daily, the patterns that define distinct user segments remain hidden in the noise. Traditional analytics approaches force you to define user segments upfront … Read more

Real-Time Inference Architecture Using Kinesis and SageMaker

January 6, 2026 by mljourney

Real-time machine learning inference has become a critical capability for modern applications, from fraud detection systems that evaluate transactions in milliseconds to recommendation engines that personalize content as users browse. While many organizations understand the value of real-time predictions, building a production-grade architecture that handles high throughput, maintains low latency, and scales elastically remains challenging. … Read more

Managing Vector Database Lifecycle in AI Search Applications

January 5, 2026 by Peter Song

When you’re building AI-powered search applications with vector databases, the initial excitement of getting semantic search working quickly gives way to the reality of managing these systems in production. Vector databases aren’t set-and-forget infrastructure—they require careful lifecycle management to maintain performance, accuracy, and cost-effectiveness as your data grows and changes. Unlike traditional databases where you … Read more

What is Kubernetes vs Airflow? Understanding Two Complementary Technologies

January 4, 2026 by Peter Song

When you’re building modern data infrastructure or deploying applications at scale, you’ll inevitably encounter both Kubernetes and Apache Airflow. These technologies often appear together in architecture diagrams and job postings, leading to confusion about their relationship. Are they competitors? Alternatives? Complementary tools? The answer is that Kubernetes and Airflow serve fundamentally different purposes—Kubernetes is a … Read more

How Singular Value Decomposition Stabilizes Linear Regression

January 3, 2026 by Peter Song

When you’re working with linear regression, especially in high-dimensional settings or with correlated predictors, you’ll inevitably encounter numerical instability issues that make standard solutions unreliable or impossible to compute. The classic normal equations approach—solving (X^T X)β = X^T y for the coefficients β—breaks down when X^T X is singular, near-singular, or poorly conditioned. This is … Read more

Precision-Recall Tradeoff in Imbalanced Classification with Examples

January 2, 2026 by Peter Song

When you’re building classification models for real-world problems—fraud detection, disease diagnosis, or spam filtering—you’ll quickly discover that accuracy is a deceptive metric. This is especially true when dealing with imbalanced datasets where one class vastly outnumbers the other. In these scenarios, understanding the precision-recall tradeoff becomes not just important but absolutely critical for building models … Read more

How to Choose Epsilon in DBSCAN

January 1, 2026 by Peter Song

When you’re working with density-based clustering using DBSCAN, the most critical—and often most frustrating—challenge is selecting the right epsilon (ε) parameter. This single value determines the radius around each point that defines its neighborhood, fundamentally shaping whether your clustering succeeds or fails. Choose epsilon too small, and you’ll fragment natural clusters into meaningless pieces. Choose … Read more

Data Quality Checks for Machine Learning Models Using Great Expectations

December 31, 2025 by Peter Song

Machine learning models are only as good as the data they’re trained on. A model trained on poor-quality data will produce unreliable predictions, regardless of how sophisticated its architecture might be. This fundamental principle has led to the rise of data validation frameworks, with Great Expectations emerging as one of the most powerful tools for … Read more

KL Divergence Explained: Information Theory’s Most Important Metric

December 31, 2025 by Peter Song

When you’re working with probability distributions in machine learning, statistics, or information theory, you’ll inevitably encounter KL divergence. This mathematical concept might seem intimidating at first, but it’s one of the most fundamental tools for comparing distributions and understanding how information flows in systems. Whether you’re training neural networks, analyzing data, or optimizing models, grasping … Read more

Implementing Online Feature Pipelines with Kafka and Flink for Real-Time ML

December 30, 2025 by Peter Song

Real-time machine learning has transformed from a luxury to a necessity for modern applications. Whether powering fraud detection systems that must respond within milliseconds, recommendation engines that adapt to user behavior instantly, or dynamic pricing algorithms that adjust to market conditions in real-time, the ability to compute and serve fresh features is critical. However, bridging … Read more