Programming lesson
Wine Origin Prediction with K-Means Clustering and Mean Feature Ranking: A Step-by-Step Tutorial
Learn how to predict wine origin using K-means clustering and mean feature ranking. This tutorial covers dataset loading, feature ranking with cross-validation, and analysis of results, with Python code examples.
Introduction to Pattern Recognition in Wine Analysis
Pattern recognition and computer vision are transforming industries from AI-powered agriculture to quality control in food production. In this tutorial, we tackle a classic pattern recognition problem: predicting the origin of Italian wines using chemical attributes. This task mirrors real-world applications where machine learning for wine classification helps authenticate regional wines, combat fraud, and ensure quality. The dataset contains 178 samples with 13 features like alcohol content, acidity, and flavonoids, sourced from three Italian regions. Using K-means clustering (an unsupervised learning model) combined with a mean feature ranking method, we'll identify which chemical constituents most distinguish wine origins.
This tutorial is designed for students in DTS201TC Pattern Recognition and Computer Vision and anyone curious about feature selection techniques in clustering. By the end, you'll understand how to implement a K-means classifier for origin prediction, rank features using cross-validation, and interpret results for a technical report.
Understanding the Wine Dataset and Problem
The wine dataset is a benchmark in pattern recognition. It contains 178 samples, each with 13 continuous features derived from chemical analysis. The target variable is the region (class 0, 1, or 2). This is a multiclass classification problem, but we treat it as unsupervised because K-means does not use labels during training. The labels are only used to evaluate accuracy after clustering.
Why use K-means? It's a nonparametric classification model — it makes no assumptions about the underlying data distribution. This is useful when the data may not follow a Gaussian or other parametric form. K-means partitions data into K clusters by minimizing within-cluster variance. Here, we set K=3 (matching the three regions).
System Design: Mean Feature Ranking with K-Means
Flowchart of the Classification System
The system follows these steps:
- Load the wine dataset (178 samples, 13 features).
- For each of the 13 features, perform 3-fold cross-validation:
- Split data into 3 folds.
- For each fold, train K-means (K=3) on the training set using only that single feature.
- Predict cluster labels for the test set.
- Map clusters to true labels using the Hungarian algorithm (or majority vote) to compute accuracy.
- Compute the mean accuracy across folds for each feature.
- Rank features by mean accuracy (higher is better).
- Also compute accuracy using all 13 features as baseline.
- Calculate accuracy difference: (accuracy with single feature) - (baseline accuracy).
Description of the Mean Feature Ranking Method
The mean feature ranking method evaluates each feature independently by measuring how well a classifier (here K-means) can predict the target using only that feature. The procedure is:
- For feature i, apply N-fold cross-validation (N=3).
- In each fold, train K-means on the training data using only feature i.
- Compute accuracy on the test set after aligning clusters with true labels.
- Average accuracy across folds: this is the mean score for feature i.
- Rank features from highest mean score to lowest.
This method is simple and computationally efficient, but it ignores feature interactions. It's a filter-based feature selection technique often used as a baseline.
Experimental Results with Python Code
Plotting First Two Dimensions
First, let's visualize the data by plotting the first two features (alcohol and malic acid) with colors representing true classes. This helps us see if clusters are separable.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
data = load_wine()
X = data.data
y = data.target
plt.figure(figsize=(8,6))
scatter = plt.scatter(X[:,0], X[:,1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.title('Wine Data: First Two Features')
plt.colorbar(scatter, ticks=[0,1,2], label='Class')
plt.show()The plot shows some overlap but also clear grouping, suggesting that clustering might work.
Feature Ranking Using K-Means with 3-Fold Cross-Validation
We implement the mean feature ranking method with K-means (K=3) and 3-fold CV.
import numpy as np
from sklearn.cluster import KMeans
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# Map cluster labels to true labels using Hungarian algorithm
D = max(y_pred.max(), y_true.max()) + 1
w = np.zeros((D, D), dtype=np.int64)
for i in range(y_pred.size):
w[y_pred[i], y_true[i]] += 1
row_ind, col_ind = linear_sum_assignment(w.max() - w)
return w[row_ind, col_ind].sum() / y_pred.size
data = load_wine()
X = data.data
y = data.target
kf = KFold(n_splits=3, shuffle=True, random_state=42)
baseline_acc = []
# Baseline: all 13 features
for train_idx, test_idx in kf.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
km = KMeans(n_clusters=3, random_state=42, n_init=10)
km.fit(X_train)
y_pred = km.predict(X_test)
baseline_acc.append(cluster_accuracy(y_test, y_pred))
baseline_mean = np.mean(baseline_acc)
# Feature ranking
feature_ranks = []
for f in range(13):
acc_fold = []
for train_idx, test_idx in kf.split(X):
X_train = X[train_idx, f].reshape(-1,1)
X_test = X[test_idx, f].reshape(-1,1)
y_train, y_test = y[train_idx], y[test_idx]
km = KMeans(n_clusters=3, random_state=42, n_init=10)
km.fit(X_train)
y_pred = km.predict(X_test)
acc_fold.append(cluster_accuracy(y_test, y_pred))
mean_acc = np.mean(acc_fold)
diff = mean_acc - baseline_mean
feature_ranks.append((f, mean_acc, diff))
# Sort by mean accuracy descending
feature_ranks.sort(key=lambda x: x[1], reverse=True)
print("Feature Ranking Results:")
print("Rank | Feature | Mean Acc | Diff from baseline")
for rank, (f, acc, diff) in enumerate(feature_ranks, 1):
print(f"{rank:4d} | {f:7d} | {acc:.3f} | {diff:+.3f}")The output will be a table similar to Table 1 in the assignment. For example:
Rank | Feature | Mean Acc | Diff from baseline
1 | 6 | 0.876 | +0.076
2 | 9 | 0.860 | +0.060
...Analysis of Results
From the ranking, we observe that certain features (e.g., flavonoids, proline) yield higher accuracy than the baseline using all features. This suggests that the mean feature ranking method effectively identifies discriminative features. However, K-means is a nonparametric model because it does not assume a specific probability distribution; it partitions data based on distance. Advantages include simplicity and scalability. Disadvantages include sensitivity to initial centroids and the assumption of spherical clusters. In our experiment, the mapping from clusters to true labels is imperfect due to cluster alignment issues, which can lower accuracy. The mean feature ranking method works well when features are independent, but it may miss interactions.
This approach is similar to how AI in sports analytics might rank player statistics to predict game outcomes, or how recommendation systems rank features to personalize content. By understanding which chemical attributes most influence wine origin, vintners can focus on key quality indicators.
Conclusion
Advantages and Disadvantages of Mean Feature Ranking
The mean feature ranking method is simple, fast, and interpretable. It provides a clear ranking of features based on their individual predictive power. However, it ignores feature interactions and may rank redundant features highly. In our experiment, some single features outperformed the full set, indicating that K-means struggles with high-dimensional data (curse of dimensionality). Thus, dimensionality reduction via feature selection can improve performance.
Alternative Feature Ranking Methods
Better methods include Recursive Feature Elimination (RFE), mutual information, or LASSO regularization. RFE recursively removes features and builds models, capturing interactions. Mutual information measures dependency between features and target. LASSO performs embedded feature selection during model training. For clustering, one could use silhouette score-based ranking or feature importance from tree-based models applied to cluster labels. These methods often yield more robust rankings but are more computationally expensive.
In conclusion, the mean feature ranking method is a valuable baseline for feature selection in pattern recognition tasks like wine origin prediction. By combining it with K-means clustering, students gain hands-on experience with unsupervised learning and feature evaluation—skills applicable to modern AI challenges from fraud detection to personalized medicine.
References
- Dua, D., & Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
- MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.
- Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182.
- Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.
- Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666.