Heart Disease Prediction Using SVM

Predicting heart disease accurately is a critical challenge in healthcare. With the advancement of machine learning algorithms, Support Vector Machines (SVM) have shown promising results in predicting heart disease. This article provides a comprehensive guide on using SVM for heart disease prediction, including data preprocessing, model training, and evaluation.

Introduction

Heart disease is one of the leading causes of death globally. Early detection and diagnosis are crucial for effective treatment and prevention. Machine learning algorithms, especially Support Vector Machines (SVM), offer significant potential in developing predictive models that can assist healthcare professionals in identifying high-risk individuals based on various health metrics and patient history.

Understanding Support Vector Machines (SVM)

SVM is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the hyperplane that best divides a dataset into classes. SVM is particularly effective for high-dimensional data and is known for its robustness in handling outliers and non-linear data through kernel functions.

Key Features of SVM

  • Margin Maximization: SVM aims to maximize the margin between the data points of different classes.
  • Kernel Trick: Allows SVM to operate in a high-dimensional space by applying kernel functions, enabling it to handle non-linear relationships.
  • Regularization: Helps prevent overfitting by balancing the margin size and classification error.

Heart Disease Dataset Examples

Heart disease prediction datasets typically include various features that represent different health metrics and patient information. Here are some examples to illustrate how such a dataset might look:

Example 1: Basic Health Metrics Data

This dataset includes basic health metrics such as age, sex, cholesterol levels, and presence of heart disease.

AgeSexCholesterolResting_BPMax_Heart_RateHas_Heart_Disease
52M2201401721
47F1801301680
54M2401501601
39F1901201650
59M2801601581

Example 2: Extended Data with More Features

This dataset includes additional features such as fasting blood sugar, electrocardiographic results, and exercise-induced angina.

AgeSexCholesterolResting_BPMax_Heart_RateFasting_Blood_SugarECG_ResultExercise_AnginaHas_Heart_Disease
52M2201401721011
47F1801301680100
54M2401501601111
39F1901201650000
59M2801601581011

Example 3: Dataset Including Categorical Variables

This dataset includes categorical variables such as chest pain type and the slope of the peak exercise ST segment.

AgeSexChest_Pain_TypeResting_BPCholesterolFasting_Blood_SugarResting_ECGMax_Heart_RateExercise_AnginaST_SlopeHas_Heart_Disease
52MTypical_Angina1402201Normal172YesUp1
47FAsymptomatic1301800ST168NoFlat0
54MNon_Anginal1502401LVH160YesDown1
39FAtypical_Angina1201900Normal165NoUp0
59MAsymptomatic1602801ST158YesFlat1

Example 4: Comprehensive Dataset with Numeric and Categorical Data

This dataset includes a mix of numerical and categorical features, representing a more detailed health profile.

AgeSexChest_Pain_TypeResting_BPCholesterolFasting_Blood_SugarResting_ECGMax_Heart_RateExercise_AnginaOldpeakST_SlopeCaThalHas_Heart_Disease
52MTypical_Angina1402201Normal172Yes1.2Up0Normal1
47FAsymptomatic1301800ST168No0.6Flat1Fixed0
54MNon_Anginal1502401LVH160Yes2.3Down2Reversible1
39FAtypical_Angina1201900Normal165No0.0Up0Normal0
59MAsymptomatic1602801ST158Yes1.5Flat3Fixed1

These examples show typical columns found in heart disease prediction datasets, which can include a combination of numeric and categorical features representing patient demographics, health metrics, and diagnostic test results. These features are crucial for building predictive models using machine learning algorithms like SVM.

Data Preprocessing

Effective data preprocessing is vital for building a robust SVM model. Here are the common steps involved:

Data Cleaning

  1. Handling Missing Values: Replace missing values with mean, median, or mode, or use advanced imputation techniques.
  2. Removing Duplicates: Identify and remove duplicate records to ensure data quality.

Feature Selection

Feature selection is crucial for improving model accuracy. Techniques like the χ² (Chi-square) statistical test can be used to select the most relevant features for heart disease prediction​ (Springer)​​ (MDPI)​.

Normalization

Normalization scales the features to a range, ensuring that no single feature dominates the learning process. This is especially important for algorithms like SVM that are sensitive to feature scales.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

Building the SVM Model

Splitting the Data

Divide the dataset into training and testing sets to evaluate the model’s performance effectively.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

Training the Model

Train the SVM model using the training data.

from sklearn.svm import SVC

svm_model = SVC(kernel='linear', C=1)
svm_model.fit(X_train, y_train)

Model Evaluation

Evaluate the model using metrics like accuracy, precision, recall, and F1-score.

from sklearn.metrics import accuracy_score, classification_report

y_pred = svm_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Hyperparameter Tuning

Hyperparameter tuning can significantly improve the performance of the SVM model. Techniques like Grid Search or Random Search are commonly used.

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, refit=True, verbose=2)
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)

Advanced Techniques

Cross-Validation

Cross-validation provides a more robust evaluation of model performance by splitting the data into multiple folds.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(svm_model, features, target, cv=5)
print("Cross-validation scores:", scores)

Feature Engineering

Creating new features can enhance model accuracy. For example, combining related features or transforming features based on domain knowledge.

df['New_Feature'] = df['Feature1'] + df['Feature2']

Case Study: Heart Disease Prediction

Dataset

The dataset used for this case study is the Heart Disease dataset from the UCI Machine Learning Repository, which includes 303 instances and 14 features​ (GitHub)​​ (SpringerLink)​.

Implementation

  1. Load the Data: Import the dataset and perform initial exploration.
  2. Preprocessing: Handle missing values, normalize features, and select relevant features.
  3. Model Training: Train the SVM model using the preprocessed data.
  4. Evaluation: Evaluate the model’s performance using appropriate metrics.

Detailed Steps

Loading and Exploring the Data

First, load the dataset and perform an initial exploration to understand its structure and contents.

import pandas as pd

# Load dataset
df = pd.read_csv('heart_disease_data.csv')

# Display first few rows
print(df.head())

# Summary statistics
print(df.describe())

# Information about dataset
print(df.info())

Handling Missing Values

Check for missing values and handle them appropriately.

# Check for missing values
print(df.isnull().sum())

# Impute missing values with mean
df.fillna(df.mean(), inplace=True)

Feature Selection

Use feature selection techniques to choose the most relevant features.

from sklearn.feature_selection import SelectKBest, chi2

# Select top 10 features
best_features = SelectKBest(score_func=chi2, k=10)
fit = best_features.fit(df.drop('target', axis=1), df['target'])

# Get selected feature indices
indices = fit.get_support(indices=True)

# Filter the dataframe to keep only selected features
df_selected = df.iloc[:, indices]
df_selected['target'] = df['target']

Normalizing Features

Normalize the features to ensure consistent scaling.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_selected.drop('target', axis=1))

# Convert to DataFrame
df_scaled = pd.DataFrame(scaled_features, columns=df_selected.columns[:-1])
df_scaled['target'] = df_selected['target']

Model Training and Evaluation

Split the data into training and testing sets, train the SVM model, and evaluate its performance.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(df_scaled.drop('target', axis=1), df_scaled['target'], test_size=0.2, random_state=42)

# Train SVM model
svm_model = SVC(kernel='linear', C=1)
svm_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = svm_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Conclusion

Support Vector Machines (SVM) provide a powerful tool for predicting heart disease. By following the steps outlined in this guide, you can build an effective SVM model for heart disease prediction. Remember that the success of your model depends on thorough data preprocessing, feature selection, and hyperparameter tuning​ (GitHub)​​ (SpringerLink)​.

Through continuous experimentation and validation, you can enhance the model’s performance, ultimately contributing to better healthcare outcomes by enabling early detection and intervention for heart disease.

Leave a Comment