Unsupervised Learning on MNIST: PCA & K-Means Tutorial for MATH38161

Introduction to Unsupervised Learning with MNIST

In the era of AI-generated content and self-driving cars, the ability to find patterns in data without labels is a superpower. The MNIST database (Modified National Institute of Standards and Technology) is a classic benchmark for machine learning. For this tutorial, we focus on a subset of 3000 handwritten digits — specifically the digits 5, 6, and 7 — each represented as a 28×28 grayscale image (784 pixels). The goal is to apply unsupervised learning techniques to discover natural groupings and reduce dimensionality. This mirrors how modern AI apps like photo clustering or anomaly detection work in the wild.

Loading and Exploring the Data

Your dataset is a 3000×785 matrix. The first column is the true label (digit), and columns 2–785 are pixel intensities (0–255). In R, you can read it with:

data <- as.matrix(read.table("digit.txt"))
labels <- data[, 1]
pixels <- data[, -1]

Let's quickly check the distribution:

table(labels)

You might see roughly 1000 of each digit. This balance is important for clustering evaluation later.

Principal Component Analysis (PCA) for Dimensionality Reduction

With 784 dimensions, visualization is impossible. PCA reduces dimensions while preserving variance. It's like summarizing a viral TikTok trend into a few key themes — you capture the essence without every detail.

Run PCA on the pixel data (center and scale recommended):

pca_result <- prcomp(pixels, center = TRUE, scale. = TRUE)
summary(pca_result)

The first two principal components often explain ~30% of variance. Plot the scores:

plot(pca_result$x[,1], pca_result$x[,2], col = labels, pch = 19, main = "PCA of MNIST Digits 5,6,7")

You'll see three clusters overlapping — digits 5 and 6 are more separable than 5 vs 7. This tells us that even without labels, the data has structure.

K-Means Clustering: Grouping Digits Without Labels

Now, let's apply k-means clustering to partition the 3000 images into 3 groups (k=3). This is analogous to how a music streaming service groups songs into playlists based on audio features.

set.seed(123)
kmeans_result <- kmeans(pixels, centers = 3, nstart = 25)
table(kmeans_result$cluster, labels)

Check the confusion matrix. You might get ~80% accuracy — not perfect, but impressive given no label information. The misclassifications often occur between similar-looking digits (e.g., a poorly written 5 might look like 6).

To improve, consider using PCA scores as input to k-means (denoising):

kmeans_pca <- kmeans(pca_result$x[,1:50], centers = 3, nstart = 25)
table(kmeans_pca$cluster, labels)

Using 50 PCs often yields similar or better clustering because it removes noise.

Evaluating Cluster Quality with Silhouette Score

The silhouette score measures how similar a point is to its own cluster versus others. Values near 1 indicate well-separated clusters. Compute it on the PCA-reduced data:

library(cluster)
sil <- silhouette(kmeans_pca$cluster, dist(pca_result$x[,1:50]))
mean(sil[,3])

A score >0.3 suggests reasonable structure. For MNIST, you might get ~0.4.

Visualizing Cluster Centers

Each cluster center from k-means is a 784-dimensional vector. Reshape it to 28×28 and display as an image to see the "average digit" per cluster:

par(mfrow=c(1,3))
for (i in 1:3) {
  center <- matrix(kmeans_result$centers[i,], nrow = 28, byrow = TRUE)
  image(center, col = gray(0:255/255), main = paste("Cluster", i))
}

You'll see that cluster centers resemble blurred versions of 5,6,7. This is similar to how AI image generators learn the "essence" of a concept.

Tuning Parameters: Elbow Method for k

How many clusters should we choose? Use the elbow method on the within-cluster sum of squares:

wss <- sapply(1:10, function(k) kmeans(pixels, centers = k, nstart = 10)$tot.withinss)
plot(1:10, wss, type = "b", xlab = "Number of clusters k", ylab = "Total WSS")

The elbow at k=3 confirms our choice. This is a standard practice in customer segmentation for e-commerce apps.

Hierarchical Clustering: An Alternative Approach

For smaller datasets (e.g., a random subset of 300 points), hierarchical clustering can reveal relationships. Use Ward's method:

set.seed(42)
idx <- sample(1:3000, 300)
hc <- hclust(dist(pixels[idx,]), method = "ward.D2")
plot(hc, labels = labels[idx], main = "Hierarchical Clustering Dendrogram")

The dendrogram shows three main branches, again supporting the digit grouping.

Connection to Real-World AI Trends

Unsupervised learning on MNIST is a microcosm of how AI systems organize data. For instance, self-supervised learning (used in large language models like GPT) relies on similar principles: find structure without explicit labels. In 2026, many AI apps use clustering to personalize content, detect fraud, or even analyze sports plays (e.g., clustering player movements into patterns).

Tip: In your MATH38161 assignment, you might also explore t-SNE or UMAP for visualization. These are popular in bioinformatics and image analysis.

Common Pitfalls and How to Avoid Them

Scaling: Always scale pixel values (0-255) to unit variance. PCA without scaling may be dominated by high-intensity pixels.
Random seeds: K-means is sensitive to initialization. Set a seed and run multiple starts (nstart).
Curse of dimensionality: With 784 dimensions, distances become less meaningful. PCA or feature selection is essential.

Conclusion

You've now applied PCA, k-means, and hierarchical clustering to the MNIST digits 5,6,7. These techniques are foundational for any data scientist working with high-dimensional data. The skills you develop here transfer directly to analyzing financial data, sensor data, or even social media trends. Remember: unsupervised learning is about discovery — let the data speak.

For your MATH38161 assignment, consider comparing algorithms, evaluating with metrics like adjusted Rand index, and discussing the limitations of each method. Good luck!