How to Use Sklearn for Regression Analysis

Scikit-learn (sklearn) is one of the most popular machine learning libraries in Python. It provides simple and efficient tools for data mining and data analysis. In this blog post, we will delve into how to use sklearn for regression analysis, a key method for predicting continuous outcomes.

What is Regression Analysis?

Regression analysis is a statistical technique used to determine the relationship between a dependent variable and one or more independent variables. It’s widely used in various fields such as finance, economics, and biological sciences. In machine learning, regression is used for tasks like predicting prices, sales forecasting, and risk assessment.

Regression Types

Regression analysis can be categorized into several types based on the nature of the relationship between the dependent and independent variables. Some common types include:

Linear Regression

Linear regression models the relationship between the dependent variable and one or more independent variables using a straight line. It is the simplest form of regression and is useful for predicting continuous outcomes.

Polynomial Regression

Polynomial regression models the relationship between the dependent and independent variables as an nth-degree polynomial. This type is useful when the data shows a curvilinear relationship.

Ridge Regression

Ridge regression is a type of linear regression that includes a regularization term to penalize large coefficients. This helps prevent overfitting, especially when dealing with multicollinear data.

Lasso Regression

Lasso regression (Least Absolute Shrinkage and Selection Operator) is another regularization technique that can shrink some coefficients to zero, effectively performing variable selection and improving model interpretability.

Logistic Regression

Logistic regression is used when the dependent variable is binary. It models the probability that a given input point belongs to a particular category.

Introduction to Sklearn

Sklearn is a robust library that includes various algorithms for classification, regression, clustering, and more. It is built on NumPy, SciPy, and matplotlib, making it an excellent choice for performing machine learning tasks in Python.

Setting Up the Environment

Before we dive into the code, we need to set up our environment. Ensure you have Python installed and then install sklearn using pip:

pip install scikit-learn

Loading Data

For regression analysis, we need a dataset. Sklearn provides several sample datasets for practice. For this tutorial, we will use the Boston housing dataset, which is included in sklearn.

from sklearn.datasets import load_boston
data = load_boston()

Exploring the Dataset

Understanding the dataset is crucial before applying any machine learning algorithm. Let’s take a quick look at the Boston housing dataset.

import pandas as pd
df = pd.DataFrame(data.data, columns=data.feature_names)
df['PRICE'] = data.target
print(df.head())

Data Preprocessing

Data preprocessing is an essential step in any machine learning pipeline. This includes handling missing values, encoding categorical variables, and feature scaling. For the Boston housing dataset, let’s proceed with feature scaling.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop('PRICE', axis=1))

Splitting the Data

We need to split the dataset into training and testing sets. This helps in evaluating the model’s performance.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, df['PRICE'], test_size=0.2, random_state=42)

Building the Regression Model

Now, we will build a regression model using Linear Regression from sklearn.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

Evaluating the Model

Evaluation is critical to understand how well our model performs. Common metrics for regression models include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared.

from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)

print(f'MSE: {mse}')
print(f'RMSE: {rmse}')
print(f'R-squared: {r2}')

Feature Importance

Understanding the importance of each feature can provide insights into the model. In linear regression, coefficients represent the importance of each feature.

importance = model.coef_
for i, v in enumerate(importance):
print(f'Feature: {data.feature_names[i]}, Score: {v}')

Visualizing the Results

Visualization helps in better understanding the model’s performance. Let’s plot the actual vs predicted values.

import matplotlib.pyplot as plt

plt.scatter(y_test, y_pred)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted Prices')
plt.show()

Hyperparameter Tuning

Hyperparameter tuning can significantly improve the model’s performance. We can use GridSearchCV to find the best parameters.

from sklearn.model_selection import GridSearchCV

parameters = {'fit_intercept': [True, False], 'normalize': [True, False]}
grid = GridSearchCV(estimator=model, param_grid=parameters, cv=5)
grid.fit(X_train, y_train)

print(grid.best_params_)

Advanced Regression Techniques

Sklearn provides several advanced regression techniques such as Ridge Regression, Lasso Regression, and ElasticNet. These techniques help in handling multicollinearity and feature selection.

Ridge Regression

from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

Lasso Regression

from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

ElasticNet

from sklearn.linear_model import ElasticNet

elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.7)
elasticnet.fit(X_train, y_train)

Comparing Regression Models

It’s essential to compare different regression models to select the best one for your specific problem.

models = {'Linear Regression': model, 'Ridge Regression': ridge, 'Lasso Regression': lasso, 'ElasticNet': elasticnet}

for name, model in models.items():
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)
print(f'{name} - MSE: {mse}, RMSE: {rmse}, R-squared: {r2}')

Conclusion

Regression analysis is a powerful tool for predicting continuous outcomes. Sklearn provides a comprehensive suite of tools for building and evaluating regression models. By following the steps outlined in this blog post, you can effectively use sklearn for regression analysis and apply it to various real-world problems.

Leave a Comment