pyspark Archives - ML Journey

How to Use Jupyter Notebook for Big Data Exploration with PySpark

November 6, 2025 by Peter Song

Big data has become the lifeblood of modern data-driven organizations, but working with massive datasets requires tools that can handle scale without sacrificing usability. Jupyter Notebook combined with PySpark offers a powerful solution—bringing the interactive, iterative nature of notebook-based development to the distributed computing capabilities of Apache Spark. This combination allows data scientists and engineers … Read more

Building a Scalable PySpark Data Pipeline: Step-by-Step Example

October 22, 2025 by Peter Song

Building data pipelines that scale from gigabytes to terabytes requires fundamentally different approaches than traditional single-machine processing. PySpark provides the distributed computing framework necessary for handling enterprise-scale data, but knowing how to structure pipelines for scalability requires understanding both the framework’s capabilities and distributed computing principles. This guide walks through building a complete, production-ready PySpark … Read more