Step-by-Step Linear Regression in Jupyter Notebook

Linear regression is the foundation of predictive modeling and machine learning. Whether you’re predicting house prices, sales figures, or temperature trends, linear regression provides a powerful yet interpretable approach to understanding relationships between variables. This comprehensive guide will walk you through implementing linear regression in Jupyter Notebook from start to finish, covering everything from data preparation to model evaluation. By the end, you’ll have a complete working implementation you can adapt to your own projects.

Setting Up Your Jupyter Notebook Environment

Before diving into the code, you need to ensure your Jupyter Notebook environment has the necessary libraries installed. The primary tools we’ll use are NumPy for numerical operations, Pandas for data manipulation, Matplotlib and Seaborn for visualization, and scikit-learn for the machine learning implementation.

Open your terminal or command prompt and install the required packages if you haven’t already:

pip install jupyter numpy pandas matplotlib seaborn scikit-learn

Once installed, launch Jupyter Notebook by typing jupyter notebook in your terminal. This opens a browser window where you can create a new notebook. Create a new Python 3 notebook and you’re ready to begin.

Start your notebook by importing the essential libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Set display options for better visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

The %matplotlib inline command ensures that plots appear directly in your notebook rather than in separate windows. This magic command is specific to Jupyter and makes interactive data exploration much smoother.

Loading and Exploring Your Dataset

For this tutorial, we’ll use a dataset predicting house prices based on various features. You can use any dataset with numerical variables, but we’ll create a sample dataset to ensure everyone can follow along:

# Create a sample housing dataset
np.random.seed(42)
n_samples = 200

# Generate synthetic data
square_feet = np.random.uniform(1000, 3500, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.uniform(0, 50, n_samples)

# Create price with some relationship to features plus noise
price = (150 * square_feet + 20000 * bedrooms - 1000 * age + 
         np.random.normal(0, 50000, n_samples))

# Create DataFrame
df = pd.DataFrame({
    'square_feet': square_feet,
    'bedrooms': bedrooms,
    'age': age,
    'price': price
})

print(df.head())

If you’re working with your own CSV file, load it using Pandas:

df = pd.read_csv('your_dataset.csv')

Understanding your data is crucial before building any model. Execute these exploratory commands in separate cells:

# Display basic information
print(df.info())
print("\n" + "="*50 + "\n")

# Statistical summary
print(df.describe())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

The info() method shows data types and non-null counts, while describe() provides statistical summaries including mean, standard deviation, and quartiles. These insights help you understand the scale and distribution of your variables.

Visualizing relationships between variables is essential in linear regression. Create a correlation heatmap and scatter plots:

# Correlation matrix
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

# Scatter plots for each feature vs price
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
features = ['square_feet', 'bedrooms', 'age']

for idx, feature in enumerate(features):
    axes[idx].scatter(df[feature], df['price'], alpha=0.5)
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('price')
    axes[idx].set_title(f'{feature} vs Price')

plt.tight_layout()
plt.show()

The correlation matrix reveals which features have strong linear relationships with your target variable. Values close to 1 or -1 indicate strong positive or negative correlations, while values near 0 suggest weak linear relationships.

Linear Regression Workflow

1️⃣

Data Preparation Load data, handle missing values, explore features

2️⃣

Train-Test Split Separate data into training and testing sets

3️⃣

Model Training Fit linear regression model on training data

4️⃣

Prediction & Evaluation Make predictions and assess model performance

5️⃣

Analysis & Interpretation Examine coefficients, residuals, and visualize results

Preparing Data for Model Training

Linear regression requires separating your data into features (independent variables) and target (dependent variable). Features are the inputs used to make predictions, while the target is what you’re trying to predict.

# Define features and target
X = df[['square_feet', 'bedrooms', 'age']]
y = df['price']

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

The feature matrix X contains all predictor variables, while y contains the target variable. The shape output confirms you have the correct number of samples and features.

Splitting your data into training and testing sets is fundamental to machine learning. The training set teaches the model patterns, while the test set evaluates how well those patterns generalize to unseen data:

# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")

The test_size=0.2 parameter reserves 20% of data for testing. The random_state=42 ensures reproducibility—you’ll get the same split every time you run the code. This is crucial for comparing different models or sharing results with colleagues.

Building and Training the Linear Regression Model

Now comes the exciting part: creating and training your linear regression model. Thanks to scikit-learn, this process is remarkably straightforward:

# Create linear regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

print("Model training complete!")
print(f"Model intercept: {model.intercept_:.2f}")
print("\nModel coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature}: {coef:.2f}")

When you call fit(), scikit-learn calculates the optimal coefficients using the ordinary least squares method. The model finds the line (or hyperplane in multiple dimensions) that minimizes the sum of squared differences between predicted and actual values.

The intercept represents the predicted price when all features equal zero (though this may not have real-world meaning). The coefficients show how much the target variable changes for each unit increase in the corresponding feature, holding other features constant.

For example, if the square_feet coefficient is 150, it means each additional square foot increases the predicted price by $150, assuming bedrooms and age remain constant. Understanding these coefficients is key to interpreting your model’s behavior.

Making Predictions and Evaluating Performance

With a trained model, you can now make predictions on the test set:

# Make predictions on test set
y_pred = model.predict(X_test)

# Display first 10 predictions vs actual values
comparison_df = pd.DataFrame({
    'Actual': y_test[:10].values,
    'Predicted': y_pred[:10],
    'Difference': y_test[:10].values - y_pred[:10]
})
print(comparison_df)

This comparison table shows how close your predictions are to actual values. Large differences suggest the model struggles with those particular samples.

Evaluation metrics quantify model performance. The three most important metrics for regression are:

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("\nModel Performance Metrics:")
print(f"Mean Squared Error (MSE): {mse:,.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:,.2f}")
print(f"Mean Absolute Error (MAE): {mae:,.2f}")
print(f"R-squared (R²): {r2:.4f}")

Mean Squared Error (MSE) calculates the average squared difference between predictions and actual values. Squaring penalizes large errors more heavily than small ones. However, MSE is in squared units, making interpretation difficult.

Root Mean Squared Error (RMSE) takes the square root of MSE, returning the metric to the original units. If predicting house prices in dollars, RMSE tells you the typical prediction error in dollars. An RMSE of $50,000 means predictions are typically off by $50,000.

Mean Absolute Error (MAE) calculates the average absolute difference between predictions and actual values. Unlike RMSE, it treats all errors equally regardless of magnitude. MAE is often more intuitive and less sensitive to outliers.

R-squared (R²) measures the proportion of variance in the target variable explained by your features. Values range from 0 to 1, with 1 indicating perfect predictions. An R² of 0.85 means your model explains 85% of the variance in house prices. Generally:

R² above 0.7 suggests a strong model
R² between 0.4 and 0.7 indicates moderate predictive power
R² below 0.4 suggests weak predictive power (though context matters)

Visualizing Model Performance

Visualization brings your model’s performance to life. Create a predicted vs actual plot:

# Predicted vs Actual plot
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6, edgecolors='k', linewidth=0.5)
plt.plot([y_test.min(), y_test.max()], 
         [y_test.min(), y_test.max()], 
         'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Price', fontsize=12)
plt.ylabel('Predicted Price', fontsize=12)
plt.title('Predicted vs Actual House Prices', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In this plot, points falling exactly on the red diagonal line represent perfect predictions. Points above the line indicate overestimations, while points below show underestimations. Clustering around the line suggests good model performance, while scattered points indicate poor predictions.

Residual plots help diagnose model problems:

# Residual plot
residuals = y_test - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.6, edgecolors='k', linewidth=0.5)
plt.axhline(y=0, color='r', linestyle='--', lw=2)
plt.xlabel('Predicted Price', fontsize=12)
plt.ylabel('Residuals', fontsize=12)
plt.title('Residual Plot', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Residual distribution
plt.figure(figsize=(10, 6))
plt.hist(residuals, bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Residuals', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Residuals', fontsize=14)
plt.axvline(x=0, color='r', linestyle='--', lw=2)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

A good residual plot shows randomly scattered points around zero with no clear patterns. Patterns in residuals suggest:

Funnel shape: Heteroscedasticity (variance increases with predicted values)
Curved pattern: Non-linear relationships that linear regression can’t capture
Clustered outliers: Influential points that disproportionately affect the model

The residual distribution histogram should approximate a normal distribution centered at zero. Significant skewness or multiple peaks suggest model issues.

Understanding Model Coefficients

📊 Example Model Equation:

 Price = 50,000 + (150 × Square_Feet) + (20,000 × Bedrooms) – (1,000 × Age) 

✓ Positive Coefficients

Square_Feet: +150
Each additional square foot increases price by $150

Bedrooms: +20,000
Each additional bedroom increases price by $20,000

✗ Negative Coefficients

Age: -1,000
Each additional year of age decreases price by $1,000

Older houses typically have lower values

💡 Interpretation Tip:

Coefficients show the change in target for one unit change in feature, holding all other features constant. Larger absolute values indicate stronger influence on predictions.

Making Predictions on New Data

Once satisfied with your model, use it to predict outcomes for new data points:

# Create new data for prediction
new_data = pd.DataFrame({
    'square_feet': [2500, 1800, 3000],
    'bedrooms': [3, 2, 4],
    'age': [10, 25, 5]
})

# Make predictions
new_predictions = model.predict(new_data)

# Display results
new_data['predicted_price'] = new_predictions
print("\nPredictions for new houses:")
print(new_data)

This demonstrates the practical application of your model. Real-world usage might involve reading new data from a file, database, or user input, then generating predictions for business decisions.

Interpreting Coefficients and Feature Importance

Understanding what your model learned is as important as its predictive accuracy. Extract and analyze the coefficients:

# Create coefficient DataFrame for better visualization
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_,
    'Abs_Coefficient': np.abs(model.coef_)
}).sort_values('Abs_Coefficient', ascending=False)

print("\nFeature Importance (based on coefficient magnitude):")
print(coef_df)

# Visualize coefficients
plt.figure(figsize=(10, 6))
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color='steelblue', edgecolor='black')
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Feature Coefficients', fontsize=14)
plt.axvline(x=0, color='red', linestyle='--', linewidth=1)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

Features with larger absolute coefficient values have stronger influence on predictions. However, be cautious: coefficient magnitude depends on feature scale. A feature measured in thousands will naturally have a smaller coefficient than one measured in ones.

For fair comparison, you could standardize features before training (scaling them to have mean 0 and standard deviation 1). Coefficients from standardized models represent the change in target per one standard deviation change in the feature, making comparisons meaningful.

Common Issues and Troubleshooting

Low R² Score

If your R² is disappointingly low, consider these possibilities:

Missing important features: Your model may lack key predictors. Domain knowledge helps identify relevant variables.
Non-linear relationships: Linear regression assumes linear relationships. If relationships are curved or exponential, consider polynomial features or non-linear models.
High noise in data: Some phenomena are inherently unpredictable, limiting maximum achievable R².
Outliers: Extreme values can distort the model. Investigate and potentially remove or transform outliers.

High Training Performance, Low Test Performance

This indicates overfitting. Your model memorized training data rather than learning generalizable patterns. Solutions include:

Collecting more training data
Reducing feature count (feature selection)
Using regularization techniques like Ridge or Lasso regression

Patterns in Residual Plots

Non-random residual patterns suggest model assumptions are violated:

Try polynomial features to capture non-linearity
Check for and address heteroscedasticity
Consider robust regression techniques if outliers are present

Saving and Loading Your Model

After training a model, save it for future use without retraining:

import joblib

# Save the model
joblib.dump(model, 'linear_regression_model.pkl')
print("Model saved successfully!")

# Load the model (in a new session)
loaded_model = joblib.load('linear_regression_model.pkl')

# Use loaded model for predictions
test_prediction = loaded_model.predict(new_data)
print("Loaded model predictions:", test_prediction)

This is essential for deploying models in production environments where retraining for every prediction would be inefficient.

Conclusion

You’ve now completed a full linear regression implementation in Jupyter Notebook, from data loading through model evaluation and interpretation. This step-by-step approach provides a solid foundation for predictive modeling projects. You learned to prepare data, split it appropriately, train models, evaluate performance using multiple metrics, visualize results, and interpret coefficients—all essential skills for any data scientist or machine learning practitioner.

Linear regression’s simplicity and interpretability make it an ideal starting point for understanding machine learning, but don’t underestimate its practical value. Many real-world problems benefit from linear models, and the skills you’ve developed here—data preprocessing, train-test splitting, performance evaluation, and visualization—transfer directly to more complex algorithms. Keep experimenting with different datasets, features, and evaluation techniques to deepen your understanding and build confidence in your modeling abilities.