Data Science Archives

Statistical vs Machine Learning Time-Series Forecasting Models

January 9, 2026 by Peter Song

Time-series forecasting stands as one of the most critical challenges in data science, impacting everything from stock market predictions to supply chain management. As organizations increasingly rely on accurate predictions to drive decision-making, the debate between statistical and machine learning approaches has intensified. Understanding the fundamental differences, strengths, and limitations of these methodologies is essential … Read more

How to Choose Epsilon in DBSCAN

January 1, 2026 by Peter Song

When you’re working with density-based clustering using DBSCAN, the most critical—and often most frustrating—challenge is selecting the right epsilon (ε) parameter. This single value determines the radius around each point that defines its neighborhood, fundamentally shaping whether your clustering succeeds or fails. Choose epsilon too small, and you’ll fragment natural clusters into meaningless pieces. Choose … Read more

KL Divergence Explained: Information Theory’s Most Important Metric

December 31, 2025 by Peter Song

When you’re working with probability distributions in machine learning, statistics, or information theory, you’ll inevitably encounter KL divergence. This mathematical concept might seem intimidating at first, but it’s one of the most fundamental tools for comparing distributions and understanding how information flows in systems. Whether you’re training neural networks, analyzing data, or optimizing models, grasping … Read more

Cosine Similarity vs Dot Product vs Euclidean Distance

December 30, 2025 by Peter Song

Vector similarity metrics form the backbone of modern machine learning systems, from recommendation engines that suggest your next favorite movie to search engines that retrieve relevant documents from billions of candidates. Yet the choice between cosine similarity, dot product, and Euclidean distance profoundly affects results in ways that aren’t immediately obvious. A recommendation system using … Read more

How to Calculate Maximum Likelihood

December 30, 2025 by Peter Song

Maximum Likelihood Estimation (MLE) stands as one of the most fundamental techniques in statistics and machine learning for estimating parameters of probabilistic models. Whether you’re fitting a simple normal distribution to data, training a logistic regression classifier, or building complex neural networks, you’re likely using maximum likelihood principles, often without explicitly realizing it. The core … Read more

Probabilistic Graphical Models: Deep Dive into Reasoning Under Uncertainty

December 20, 2025 by Peter Song

When you’re dealing with complex systems involving uncertainty—from medical diagnosis to computer vision to natural language processing—you need a framework that can represent intricate relationships between variables while handling probabilistic reasoning. Probabilistic graphical models provide exactly that: a powerful mathematical and visual language for encoding probability distributions over high-dimensional spaces. These models have revolutionized machine … Read more

Fun Data Visualisation Ideas Using Free Datasets

December 6, 2025 by Peter Song

Data visualisation doesn’t have to be dry corporate dashboards and quarterly sales reports. Some of the most engaging, creative, and educational visualisations come from exploring quirky datasets about topics people actually care about—pop culture, sports, food, travel, and the countless fascinating patterns hidden in everyday life. The internet is overflowing with free, high-quality datasets just … Read more

Data Engineers vs Data Scientists Explained

November 27, 2025 by Peter Song

The data revolution has created two critical roles that often confuse people outside the field—and sometimes even those within it. Data engineers and data scientists both work with data, both require technical skills, and both are essential for modern data-driven organizations. Yet these roles are fundamentally different in their focus, responsibilities, and the value they … Read more

What is Google Dataset Search?

November 12, 2025 by Peter Song

In an era where data drives innovation across every field—from medical research to climate science to machine learning—finding the right datasets remains surprisingly difficult. Researchers often spend weeks searching through institutional repositories, government databases, and university websites, piecing together information scattered across thousands of sources. Google Dataset Search emerged to solve this fundamental problem: making … Read more

Security Best Practices for Cloud-Based Data Science Notebooks

November 12, 2025 by Peter Song

Cloud-based data science notebooks have revolutionized how data scientists collaborate, experiment, and deploy models. Platforms like JupyterHub, Google Colab, AWS SageMaker, and Azure ML Studio offer unprecedented flexibility and computational power. However, this convenience comes with significant security challenges that organizations cannot afford to ignore. A single misconfigured notebook can expose sensitive datasets, leak API … Read more