Heart Attack Prediction with Logistic Regression

Introduction: Why Heart Attack Prediction Matters

Heart disease remains the leading cause of death globally. In 2026, wearable health tech and AI-powered diagnostics are transforming early detection. For your CSCI 183 homework 3, you'll build a logistic regression model to predict heart attack risk using the classic UCI Heart Disease dataset. This tutorial walks you through data visualization, feature selection, model implementation, and evaluation—just like the assignment requires.

Getting Started: Load and Understand the Data

First, import necessary libraries and load the dataset. The dataset has 14 attributes (13 features + 1 target). The target is 0 (low risk) or 1 (high risk).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

df = pd.read_csv('heart.csv')
print(df.head())
print(df.info())

Check for missing values and basic statistics. The dataset is clean, but you should verify.

Step 1: Visualize Everything with Matplotlib

Your assignment asks to create as many plots as possible. Use histograms, scatter plots, box plots, and correlation heatmaps. The goal is to spot features that separate the two classes linearly.

Histograms for Single Features

Plot age distribution by target:

plt.figure(figsize=(10,6))
plt.hist([df[df['target']==0]['age'], df[df['target']==1]['age']], bins=20, label=['Low Risk', 'High Risk'], alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution by Heart Attack Risk')
plt.legend()
plt.show()

Notice if one age group dominates. Similarly, plot sex, chest pain type (cp), resting blood pressure (trestbps), cholesterol (chol), fasting blood sugar (fbs), restecg, thalach (max heart rate), exang (exercise induced angina), oldpeak, slope, ca (number of major vessels), and thal.

Scatter Plots for Feature Pairs

Scatter plots help see linear separability. For example, plot age vs. max heart rate colored by target:

plt.figure(figsize=(10,6))
scatter = plt.scatter(df['age'], df['thalach'], c=df['target'], cmap='coolwarm', alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Max Heart Rate')
plt.title('Age vs Max Heart Rate')
plt.colorbar(scatter, label='Target')
plt.show()

Look for clusters. If points of different colors separate well, that feature pair is promising. Try combinations like oldpeak vs. slope, or ca vs. thal.

Box Plots to Compare Distributions

Box plots show median, quartiles, and outliers for each class. For example:

plt.figure(figsize=(10,6))
sns.boxplot(x='target', y='oldpeak', data=df)
plt.title('Oldpeak by Target')
plt.show()

If the boxes don't overlap much, the feature is discriminative.

Correlation Heatmap

A heatmap reveals correlations between features and target. High absolute correlation suggests usefulness.

plt.figure(figsize=(12,10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation Heatmap')
plt.show()

Look for features with |correlation| > 0.3 with target. Common ones: cp, thalach, oldpeak, ca, thal.

Step 2: Select Features for Classification

Based on plots, choose features that help separate classes. Eliminate irrelevant ones like patient ID (if present) or features with overlapping distributions. For example, age may not separate well alone, but combined with other features it helps. From the heatmap, features like cp, thalach, oldpeak, ca, and thal often rank high. You might also try combinations: oldpeak and slope together can indicate ST segment changes.

Remember the assignment's advice: start with one attribute, then add others. Use visualization to eliminate attributes rather than select. For instance, fbs (fasting blood sugar) often has low correlation and overlapping distributions—consider dropping it.

Step 3: Split Dataset into Train and Test

Split 70% train, 30% test. Use stratify to maintain class balance.

X = df[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

You can also experiment with different feature subsets. For example, try using only the top 5 features from correlation.

Step 4: Implement Logistic Regression

Use sklearn's LogisticRegression. Fit on training data and predict on test.

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Optionally, standardize features for better convergence (though not required).

Step 5: Evaluate Performance

Compute accuracy, precision, recall, F1-score, and confusion matrix.

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1-score: {f1:.2f}')
print('Confusion Matrix:')
print(cm)

Create an observation table as required. For example:

| Features Used | Precision | Recall | Accuracy | F1-Score |
|---------------|-----------|--------|----------|----------|
| All 13        | 0.85      | 0.78   | 0.82     | 0.81     |
| Top 5         | 0.83      | 0.80   | 0.81     | 0.81     |

Try different feature sets and record results. The assignment emphasizes experimentation, not just highest accuracy.

Trend-Inspired Example: Like Training a Fitness AI

Think of this as building a simple AI that predicts heart risk, similar to how modern fitness apps (like Apple Health or Fitbit in 2026) use sensor data to alert users. Your logistic regression model is like a mini AI that learns from historical data—just as a game AI learns from player moves. By visualizing features, you're doing feature engineering, a key skill in machine learning.

Tips for Your Report

Include screenshots of your best plots.
Explain why you selected or eliminated each feature.
Discuss trade-offs: high recall vs. high precision depending on use case.
Mention limitations: logistic regression assumes linear decision boundary; heart disease may have nonlinear patterns.

Conclusion

By following this guide, you'll complete all requirements of CSCI 183 homework 3: extensive visualization, feature selection, model implementation, and evaluation. Remember, the goal is to experiment and learn. Even if your model isn't perfect, documenting your process thoroughly will earn you top marks. Good luck!