Cancer Tumor Classification with Naive Bayes: CSCI 184 Homework Guide

Introduction: Why Naive Bayes for Cancer Classification?

In the world of machine learning, Naive Bayes classifiers are like the reliable utility players on a championship basketball team—they may not be flashy, but they consistently deliver strong performance, especially when dealing with high-dimensional data like medical records. For your CSCI 184 homework, you'll tackle a cancer dataset with 569 rows and 32 columns, where the goal is to predict whether a tumor is malignant or benign. This tutorial will walk you through each step, from loading the data to evaluating your model, using a Naive Bayes approach. By the end, you'll have a working classifier and a clear understanding of why Naive Bayes is a go-to choice for medical diagnosis tasks.

Step 1: Load the Dataset and Explore Its Structure

First, load the 'cancer.csv' file into a pandas DataFrame. Use pd.read_csv() and immediately print the DataFrame and its shape. This helps you verify that all 569 rows and 32 columns are present. Remember, the target variable is 'diagnosis', which indicates whether the tumor is malignant (M) or benign (B).

import pandas as pd
df = pd.read_csv('cancer.csv')
print(df)
print(df.shape)

Step 2: Inspect Column Names and Data Types

Next, print the column names and their data types using df.dtypes. This step is crucial because it reveals which columns are numeric (float64) and which are categorical. For Naive Bayes, we typically need numeric features, so you'll want to ensure all features are numeric. The 'diagnosis' column will be your target after encoding.

print(df.columns)
print(df.dtypes)

Step 3: Visualize Feature Separability

Plot 'Radius Mean' vs 'Texture Mean' and color the points by diagnosis. Use a scatter plot with matplotlib or seaborn. This visualization helps you assess linear separability. If the classes overlap heavily, a linear model might struggle; but Naive Bayes can still perform well because it models probabilities rather than decision boundaries.

import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(data=df, x='radius_mean', y='texture_mean', hue='diagnosis')
plt.title('Radius Mean vs Texture Mean by Diagnosis')
plt.show()

Is the data linearly separable? Likely not perfectly, but the clusters may show some separation. This justifies using a more flexible classifier like Gaussian Naive Bayes.

Step 4: Encode the Target Variable

Since 'diagnosis' contains strings 'M' and 'B', you need to convert them to numbers. Label encoding works fine: map 'M' to 1 and 'B' to 0 (or vice versa). Use sklearn.preprocessing.LabelEncoder or a simple dictionary mapping.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['diagnosis'] = le.fit_transform(df['diagnosis'])

Step 5: Split into Features (X) and Target (Y)

Define X as all columns except 'diagnosis' (and possibly the 'id' column if present), and Y as the encoded 'diagnosis'. Drop any non-numeric or irrelevant columns.

X = df.drop(['diagnosis', 'id'], axis=1)  # adjust if 'id' exists
y = df['diagnosis']

Step 6: Train-Test Split (70-30)

Use train_test_split from sklearn to split the data. A 70-30 split is common: 70% training, 30% testing. Set random_state for reproducibility.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 7: Choose the Right Naive Bayes Variant

Given that all features are continuous (e.g., radius mean, texture mean, area), the Gaussian Naive Bayes is the most suitable. It assumes each feature follows a normal (Gaussian) distribution, which is reasonable for biomedical measurements. Other variants like Multinomial or Bernoulli are designed for discrete/count data and would not fit here.

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)

Step 8: Evaluate Model Performance

After training, predict on the test set and print a performance matrix. Use classification_report and confusion_matrix from sklearn. This will give you precision, recall, f1-score, and accuracy.

from sklearn.metrics import classification_report, confusion_matrix
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Step 9: Report Your Findings

Write a brief report summarizing your results. Include screenshots of your code outputs, the scatter plot, and the performance matrix. Discuss why Gaussian Naive Bayes was chosen and comment on the model's accuracy. For example, you might note that the model achieved high precision and recall for both classes, indicating it's a reliable classifier for this cancer dataset.

Step 10: Submit Your Work

Submit your completed Jupyter notebook (.ipynb) and a PDF of your report. Ensure your code is well-commented and your report includes all required elements. Good luck!

Why This Matters Beyond Homework

Naive Bayes classifiers are widely used in real-world applications like spam detection, sentiment analysis, and medical diagnosis. Just as AI models powering popular apps like ChatGPT rely on probabilistic reasoning, Naive Bayes offers a simple yet effective baseline. Understanding it will serve you well in advanced machine learning courses and in the growing field of AI-driven healthcare.

Remember, the key to mastering Naive Bayes is practice. This homework gives you hands-on experience with a real dataset—something that will set you apart in the job market, whether you're aiming for a role in data science, AI, or software engineering. Happy coding!