Getting Started with Your First Data Science Notebook

Taking your first steps into data science can feel overwhelming with countless tools, libraries, and concepts to master. However, data science notebooks provide an ideal starting point—they combine code execution, documentation, and visualization in a single, interactive environment that makes learning intuitive and experimentation frictionless. Whether you’re a programmer exploring data analysis for the first time, a domain expert adding technical skills, or a student beginning your data science journey, notebooks offer an accessible gateway into this exciting field. This comprehensive guide walks you through everything you need to know to create, understand, and leverage your first data science notebook, building a solid foundation for more advanced analytical work.

Understanding What a Data Science Notebook Is

Before diving into technical setup, it’s important to understand what makes notebooks different from traditional programming environments and why they’ve become the standard tool for data science work.

The Notebook Metaphor mirrors physical lab notebooks scientists use to document experiments. Each notebook contains a sequence of cells that can hold different types of content—code that executes, text formatted with markdown, mathematical equations, images, or visualizations. This structure encourages a narrative approach to analysis where you explain your thinking, show your code, and display results in a logical progression that others (including your future self) can follow and understand.

Unlike traditional scripts that execute from top to bottom in one go, notebooks let you run individual cells independently and in any order. This flexibility proves invaluable during exploratory data analysis when you’re not sure what questions to ask next. You can run a cell that loads data, examine the results, then go back and modify earlier cells based on what you learned—all without restarting your entire analysis.

Interactive Computing forms the core value proposition. When you execute a cell containing code, results appear immediately below it. If you create a chart, it displays right there in the notebook. If you print a dataframe, you see the formatted table inline. This immediate feedback loop accelerates learning and experimentation exponentially compared to traditional edit-compile-run cycles.

The combination of code, results, and explanation creates living documents that serve multiple purposes: they’re development environments during active work, documentation explaining your methodology, and presentation materials for sharing insights with stakeholders. This versatility explains why notebooks have become ubiquitous in data science—they address the entire workflow from initial exploration through final communication.

Choosing and Setting Up Your Notebook Environment

Several excellent notebook platforms exist, each with distinct advantages. Your choice depends on whether you prefer local installation with complete control or cloud-based platforms offering instant access without setup complexity.

Jupyter Notebook and JupyterLab represent the original and most widely-used notebook platforms. Jupyter Notebook provides a straightforward interface focused on individual notebook files, while JupyterLab offers a more comprehensive IDE-like experience with file browsers, terminals, and side-by-side notebook viewing. Both run on your local machine after installation, giving you complete control over your environment.

Installing Jupyter locally requires Python on your system. The simplest approach uses Anaconda, a Python distribution bundled with Jupyter and hundreds of data science packages. Download Anaconda from anaconda.com, run the installer, and you’ll have Jupyter ready within minutes. Alternatively, if you already have Python installed, run pip install jupyter from your command line to add Jupyter to your existing Python environment.

Google Colab provides an excellent cloud-based alternative requiring zero installation. Simply visit colab.research.google.com with a Google account and start creating notebooks immediately. Colab offers free access to GPU resources, automatic saving to Google Drive, and easy sharing with collaborators. The tradeoff is dependence on internet connectivity and Google’s infrastructure, but for beginners, the convenience of instant access often outweighs these considerations.

Kaggle Notebooks offer another cloud option particularly suited for beginners. Kaggle provides free notebook environments with pre-loaded datasets, tutorial notebooks, and a community of data scientists sharing their work. If you’re learning data science and want access to interesting datasets and examples, Kaggle’s integrated platform reduces friction significantly.

For this guide, we’ll assume you’re using Jupyter Notebook locally via Anaconda, though concepts translate directly to other platforms with minimal differences in interface details.

Quick Start Options Comparison

💻

Jupyter (Local)

Full control, offline access, requires installation

Best for: Long-term projects

☁️

Google Colab

Free GPU, instant access, cloud-based

Best for: Quick experiments

📊

Kaggle

Built-in datasets, community, learning resources

Best for: Learning & sharing

Creating and Understanding Your First Notebook

With your environment ready, it’s time to create your first notebook and understand its fundamental components. Launch Jupyter by opening Anaconda Navigator and clicking the Jupyter Notebook launch button, or by typing jupyter notebook in your command line. This opens your web browser showing Jupyter’s file interface.

Creating a New Notebook starts by clicking the “New” button in the upper right and selecting “Python 3” (or whichever kernel you prefer). This creates a blank notebook and opens it in a new tab. You’ll see a single empty cell ready for input, along with a toolbar containing execution controls and a menu bar with file operations.

Understanding Cell Types is crucial for effective notebook usage. Jupyter supports three primary cell types, switchable via the dropdown menu in the toolbar:

Code Cells contain executable Python code (or whatever language your kernel supports). When you run a code cell, the Python interpreter executes its contents and displays any output below the cell. Try this simple example—type print("Hello, Data Science!") into the first cell and press Shift+Enter to run it. You’ll see the output appear immediately below, and a new empty cell will be created underneath.

Markdown Cells contain formatted text, headers, lists, and documentation. Change your current cell to markdown type using the dropdown, then type # My First Notebook and run it. The text renders as a large header, creating a title for your notebook. Markdown cells let you explain your thinking, document methodology, and provide context that makes your analysis understandable to others.

Raw Cells contain unformatted text that isn’t executed or rendered, useful for specific export formats. For now, focus on code and markdown cells as these handle 99% of your needs.

Running Cells and Keyboard Shortcuts significantly impacts productivity. The most important shortcuts to learn immediately:

Shift+Enter: Run current cell and move to the next cell
Ctrl+Enter: Run current cell and stay on it
Alt+Enter: Run current cell and insert new cell below
A: Insert cell above (in command mode)
B: Insert cell below (in command mode)
D, D: Delete selected cell (in command mode)

Command mode versus edit mode represents an important distinction. When you click inside a cell to type, you’re in edit mode (indicated by a green border). Press Escape to enter command mode (blue border), where letter keys trigger shortcuts instead of typing characters. Press Enter to return to edit mode.

Loading and Exploring Your First Dataset

Real data science work begins when you start analyzing actual data. Let’s walk through loading a dataset and performing basic exploratory analysis, introducing essential libraries and techniques along the way.

Importing Essential Libraries should be your first step in any data science notebook. Add a new cell at the top and import the core libraries you’ll use:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

This imports pandas (for data manipulation), numpy (for numerical operations), and matplotlib (for visualization). The as keyword creates shortened aliases that are conventional in data science—everyone uses pd for pandas and np for numpy, making code immediately recognizable to other data scientists.

Loading Data with Pandas typically uses the read_csv() function for CSV files, though pandas supports dozens of formats including Excel, JSON, SQL databases, and web URLs. For your first exploration, let’s use a simple built-in dataset:

# Load the iris dataset from a URL
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)

# Load the iris dataset from a URL
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)

This loads the classic iris flower dataset into a pandas DataFrame—the fundamental data structure for tabular data in Python. Run this cell and you’ve completed your first data loading operation.

Examining Dataset Structure comes next. Add several cells exploring the data:

# Display first few rows
df.head()

# Display first few rows
df.head()

This shows the first five rows of your dataset, letting you see what columns exist and what the data looks like. You’ll see measurements for sepal and petal dimensions along with species classifications.

# Check dataset dimensions
df.shape

# Check dataset dimensions
df.shape

This returns a tuple like (150, 5) indicating 150 rows and 5 columns—quick confirmation of dataset size.

# Get column information and data types
df.info()

# Get column information and data types
df.info()

This summary shows column names, data types, and missing value counts—critical information for understanding data quality and structure.

# Calculate basic statistics
df.describe()

# Calculate basic statistics
df.describe()

This generates summary statistics for numerical columns—mean, standard deviation, min, max, and quartiles. These statistics provide immediate insights into data distributions and help identify potential outliers or data issues.

Understanding DataFrames is essential for data science work. A DataFrame is essentially a table with rows and columns where each column can contain different data types. You access columns using bracket notation: df['sepal_length'] returns a Series (single column), while df[['sepal_length', 'petal_length']] returns a DataFrame (multiple columns).

Practice basic operations in new cells:

# Calculate the mean of sepal length
df['sepal_length'].mean()

# Filter data for a specific species
setosa_data = df[df['species'] == 'setosa']
setosa_data.head()

# Calculate correlation between features
df.corr()

# Calculate the mean of sepal length
df['sepal_length'].mean()

# Filter data for a specific species
setosa_data = df[df['species'] == 'setosa']
setosa_data.head()

# Calculate correlation between features
df.corr()

Each operation demonstrates fundamental data manipulation techniques you’ll use constantly in data science work.

Creating Your First Visualizations

Visualizations transform numerical data into intuitive graphical representations that reveal patterns invisible in raw numbers. Let’s create several fundamental visualization types using matplotlib and seaborn.

Basic Line and Scatter Plots form the foundation of data visualization. Create a scatter plot showing the relationship between sepal length and width:

plt.figure(figsize=(10, 6))
plt.scatter(df['sepal_length'], df['sepal_width'], alpha=0.6)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Dimensions Scatter Plot')
plt.grid(True, alpha=0.3)
plt.show()

plt.figure(figsize=(10, 6))
plt.scatter(df['sepal_length'], df['sepal_width'], alpha=0.6)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Dimensions Scatter Plot')
plt.grid(True, alpha=0.3)
plt.show()

Run this cell and you’ll see a scatter plot appear directly below the code. The alpha parameter controls point transparency, making overlapping points visible. The figsize parameter controls chart dimensions. The show() function renders the plot.

Histograms for Distribution Analysis reveal how values are distributed across ranges:

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(df['petal_length'], bins=20, edgecolor='black', alpha=0.7)
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
plt.title('Petal Length Distribution')

plt.subplot(1, 2, 2)
plt.hist(df['petal_width'], bins=20, edgecolor='black', alpha=0.7, color='orange')
plt.xlabel('Petal Width (cm)')
plt.ylabel('Frequency')
plt.title('Petal Width Distribution')

plt.tight_layout()
plt.show()

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(df['petal_length'], bins=20, edgecolor='black', alpha=0.7)
plt.xlabel('Petal Length (cm)')
plt.ylabel('Frequency')
plt.title('Petal Length Distribution')

plt.subplot(1, 2, 2)
plt.hist(df['petal_width'], bins=20, edgecolor='black', alpha=0.7, color='orange')
plt.xlabel('Petal Width (cm)')
plt.ylabel('Frequency')
plt.title('Petal Width Distribution')

plt.tight_layout()
plt.show()

This creates side-by-side histograms comparing two distributions. The subplot() function divides the figure into a grid—(1, 2, 1) means 1 row, 2 columns, selecting position 1. The tight_layout() function prevents overlapping labels.

Enhanced Visualizations with Seaborn provides more sophisticated plotting with less code. First, import seaborn:

import seaborn as sns
sns.set_style("whitegrid")

import seaborn as sns
sns.set_style("whitegrid")

Then create a more advanced visualization:

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='sepal_length', y='sepal_width', 
                hue='species', style='species', s=100)
plt.title('Sepal Dimensions by Species')
plt.legend(title='Species', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='sepal_length', y='sepal_width', 
                hue='species', style='species', s=100)
plt.title('Sepal Dimensions by Species')
plt.legend(title='Species', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

This scatter plot color-codes points by species, immediately revealing that different species cluster in different regions of the feature space—a key insight for classification tasks.

Essential First Notebook Operations

Data Loading: Use pd.read_csv() for CSV files, pd.read_excel() for Excel, or pd.read_sql() for databases

Data Inspection: Use .head(), .info(), .describe(), and .shape to understand your dataset

Data Filtering: Use boolean indexing like df[df[‘column’] > 5] to filter rows

Visualization: Start with plt.scatter(), plt.hist(), and plt.plot() for basic charts

Documentation: Use markdown cells to explain your thinking and document findings

Building Your First Machine Learning Model

With data exploration complete, you’re ready to build a simple machine learning model—an exciting milestone in your data science journey. We’ll create a classification model that predicts iris species based on measurements.

Preparing Data for Machine Learning requires separating features (input variables) from the target (what we’re predicting):

# Separate features and target
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['species']

# Split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Separate features and target
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = df['species']

# Split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

This splits your data into training data (80% of samples, used to teach the model) and testing data (20% of samples, used to evaluate performance on unseen data). The random_state parameter ensures reproducible splits—run the cell multiple times and you’ll get the same split each time.

Training a Classification Model uses scikit-learn’s intuitive API:

from sklearn.ensemble import RandomForestClassifier

# Create and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print("Model trained successfully!")

from sklearn.ensemble import RandomForestClassifier

# Create and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print("Model trained successfully!")

You’ve now trained a random forest classifier—an ensemble machine learning algorithm that creates multiple decision trees and combines their predictions. The fit() method performs training using your training data.

Evaluating Model Performance measures how well your model works:

from sklearn.metrics import accuracy_score, classification_report

# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

from sklearn.metrics import accuracy_score, classification_report

# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

You’ll likely see accuracy above 90%—excellent for a first model! The classification report shows precision, recall, and F1-score for each species, providing detailed performance insights.

Making Predictions on New Data demonstrates practical model application:

# Create a new sample with specific measurements
new_flower = [[5.1, 3.5, 1.4, 0.2]]  # Sample measurements

# Predict species
prediction = model.predict(new_flower)
print(f"Predicted species: {prediction[0]}")

# Get prediction probabilities
probabilities = model.predict_proba(new_flower)
print(f"\nPrediction probabilities:")
for species, prob in zip(model.classes_, probabilities[0]):
    print(f"{species}: {prob:.2%}")

# Create a new sample with specific measurements
new_flower = [[5.1, 3.5, 1.4, 0.2]]  # Sample measurements

# Predict species
prediction = model.predict(new_flower)
print(f"Predicted species: {prediction[0]}")

# Get prediction probabilities
probabilities = model.predict_proba(new_flower)
print(f"\nPrediction probabilities:")
for species, prob in zip(model.classes_, probabilities[0]):
    print(f"{species}: {prob:.2%}")

This predicts the species for a new iris flower based on its measurements, along with confidence probabilities for each possible species.

Best Practices for Notebook Organization

As you create more complex notebooks, organization becomes crucial for maintainability and comprehension. Following these practices from the beginning establishes good habits.

Notebook Structure should follow a logical narrative flow:

Title and Introduction: Use a markdown cell at the top with a clear title and brief description of the notebook’s purpose
Imports and Setup: Group all library imports together in early cells
Data Loading: Load datasets in a dedicated section
Exploration: Perform exploratory data analysis systematically
Processing: Apply data cleaning and transformation
Analysis/Modeling: Conduct main analytical work
Results: Present findings and visualizations
Conclusions: Summarize insights and next steps

Documenting Your Work through markdown cells should explain:

Why you’re performing each analysis step
Interesting findings or unexpected results
Decisions made during data cleaning or feature engineering
Interpretations of visualizations and model results

Add markdown cells liberally—err on the side of over-documentation rather than under-documentation. Your future self reviewing the notebook months later will thank you.

Managing Cell Execution Order prevents subtle bugs. Notebooks allow running cells in any order, but this can create confusion when cells depend on previous results. Best practice: periodically restart your kernel and run all cells from top to bottom (Kernel → Restart & Run All) to ensure your notebook executes correctly in order.

Keeping Cells Focused improves readability. Each cell should perform one logical task. If a cell becomes long and complex, consider breaking it into multiple cells. This also makes debugging easier—when errors occur, you can quickly identify which specific operation failed.

Saving, Sharing, and Next Steps

Saving Your Work happens automatically in Jupyter as you work, but explicitly save using Ctrl+S or File → Save periodically, especially before major changes. Notebooks are saved as .ipynb files containing your code, outputs, and markdown content.

Exporting Notebooks to other formats enables sharing with non-technical stakeholders:

HTML: File → Download as → HTML creates standalone web pages
PDF: Requires LaTeX installation but produces professional reports
Python Scripts: Extracts just the code for production use

Version Control becomes important as notebooks evolve. Consider using Git to track changes, though note that notebook files include outputs that can make diffs messy. Tools like nbdime provide better notebook-specific version control.

Continuing Your Learning should focus on deepening skills in areas most relevant to your goals:

Data Manipulation: Master pandas groupby operations, merging datasets, and time series handling
Visualization: Explore advanced seaborn plots, interactive Plotly charts, and dashboard libraries
Machine Learning: Study different algorithm types, feature engineering techniques, and model evaluation metrics
Domain Knowledge: Apply data science to areas you’re passionate about—finance, healthcare, sports, social science

Practice regularly by working on datasets that interest you. Kaggle provides thousands of datasets and tutorials. Find questions you genuinely want to answer, then figure out how to analyze data to answer them.

Conclusion

Creating your first data science notebook marks the beginning of an exciting journey into analytical thinking and computational problem-solving. You’ve learned how notebooks combine code, documentation, and visualization into cohesive analytical narratives, mastered fundamental operations for loading and exploring data, created meaningful visualizations, and even built your first machine learning model. These foundations support increasingly sophisticated analyses as you continue developing your skills.

The key to progressing from beginner to proficient data scientist lies in consistent practice and curiosity-driven exploration. Open a notebook whenever you encounter interesting data—whether public datasets, information from your work, or personal projects—and start asking questions. The interactive, experimental nature of notebooks makes them perfect learning environments where mistakes cost nothing and insights emerge through iterative refinement. Your first notebook won’t be your best work, but it represents the crucial first step in a journey that can transform how you understand and interact with the data-rich world around us.