Introduction
Heart disease remains one of the most prevalent health issues globally, and effective early diagnosis can be life-saving. Machine learning has emerged as a potent tool for identifying disease patterns and enhancing prediction accuracy. In this article, we will explore how we built a robust machine-learning model to predict heart disease using a dataset from Kaggle, leveraging Principal Component Analysis (PCA) and Optuna for hyperparameter tuning.
Project Overview
This project tackles heart disease prediction by optimizing feature selection and model training using sophisticated techniques like PCA and hyperparameter tuning. Classifiers like Support Vector Machine (SVM), XGBoost, and Random Forest are used to understand patient data and map disease patterns.
Dataset Preparation
The Heart Failure Prediction Dataset from Kaggle, containing 918 instances and 12 features, such as age, cholesterol, etc., were used. After data cleaning, categorical features were label-encoded to enable machine learning processing.
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Load the dataset
data = pd.read_csv('heart_failure_data.csv')
# Encode categorical features
encoder = LabelEncoder()
for col in ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']:
data[col] = encoder.fit_transform(data[col])Exploratory Data Analysis (EDA)
Understanding correlations between variables helps in feature selection and can reduce multicollinearity. A correlation heatmap was generated to observe feature interactions.
import seaborn as sns
import matplotlib.pyplot as plt
# Plot correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.show()Feature Engineering with PCA
PCA was applied to reduce the dataset's dimensionality while retaining 95% of variance, resulting in a new set of 10 principal components.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Scale the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data.drop(columns='target'))
# Apply PCA
pca = PCA(n_components=0.95) # Retain 95% variance
principal_components = pca.fit_transform(scaled_data)Model Selection and Hyperparameter Tuning
Multiple classifiers were compared to find the best performing model — Logistic Regression, Naive Bayes, Random Forest, XGBoost, SVM, and K-Nearest Neighbors. Hyperparameter tuning was performed with Optuna, which automates the search process for the best parameter combinations.
import optuna
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
# Define objective function for SVM
def objective(trial):
C = trial.suggest_loguniform('C', 1e-4, 1e2)
model = SVC(C=C, kernel='linear')
scores = cross_val_score(model, principal_components, data['target'], cv=5)
return scores.mean()
# Run Optuna study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
# Best parameters
print("Best parameters: ", study.best_params)Training and Evaluation
Training and evaluation were done with cross-fold validation using the best parameters from Optuna.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(principal_components, data['target'], test_size=0.2, random_state=42)
# Train and predict with best SVM model
best_svm = SVC(C=study.best_params['C'], kernel='linear')
best_svm.fit(X_train, y_train)
y_pred = best_svm.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1 Score: {f1}")Results showed that SVM achieved the highest accuracy at 86%, making it the top-performing model for heart disease prediction.
Conclusion and Future Directions
Through this project, we achieved high accuracy in predicting heart disease using machine learning. By incorporating PCA and Optuna, we refined our feature engineering and model tuning processes, optimizing prediction performance. Future enhancements could include deep learning models or larger, more diverse datasets to further improve model generalizability.