Implementing K-Means Clustering in Python

K-Means clustering is one of the most popular unsupervised learning algorithms used for partitioning a dataset into distinct clusters. It is simple, efficient, and widely used in various applications such as market segmentation, image compression, and pattern recognition. This blog post will provide a comprehensive guide to implementing K-Means clustering in Python.

What is K-Means Clustering?

K-Means clustering is an iterative algorithm that divides a dataset into K distinct, non-overlapping subgroups or clusters. Each cluster is characterized by its centroid, which is the mean of all points in the cluster. The algorithm aims to minimize the variance within each cluster, effectively grouping similar data points together.

Key Concepts of K-Means

  1. Centroids: The center of a cluster, calculated as the mean of all data points in the cluster.
  2. Inertia: The sum of squared distances between each data point and its nearest centroid. The algorithm aims to minimize this value.
  3. Iterations: The process of updating centroids and reassigning data points continues until convergence, typically when the centroids no longer change significantly.

How K-Means Clustering Works

K-Means clustering follows a straightforward iterative process:

  1. Initialization: Select K initial centroids randomly.
  2. Assignment: Assign each data point to the nearest centroid.
  3. Update: Recalculate the centroids based on the mean of the assigned data points.
  4. Repeat: Repeat the assignment and update steps until the centroids converge.

Example of K-Means Algorithm

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic dataset
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Initialize KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

# Predict clusters
y_kmeans = kmeans.predict(X)

# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.show()

Choosing the Number of Clusters

Choosing the optimal number of clusters (K) is crucial for effective clustering. Several methods can help determine the appropriate value of K.

Elbow Method

The Elbow Method involves plotting the inertia against the number of clusters and identifying the “elbow point,” where the rate of decrease sharply slows down. This point suggests the optimal number of clusters.

inertias = []

for k in range(1, 11):
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
inertias.append(kmeans.inertia_)

plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

Silhouette Score

The Silhouette Score measures how similar a data point is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters.

from sklearn.metrics import silhouette_score

kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
score = silhouette_score(X, y_kmeans)
print("Silhouette Score:", score)

Implementing K-Means Clustering in Python

Let’s dive deeper into implementing K-Means clustering in Python using the scikit-learn library. We will cover data preparation, model training, evaluation, and visualization.

Data Preparation

Data preparation involves loading the dataset, handling missing values, and normalizing the data. For this example, we will use the Iris dataset.

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_iris()
X = data.data

# Normalize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Training the K-Means Model

Train the K-Means model by specifying the number of clusters and fitting it to the data.

kmeans = KMeans(n_clusters=3)
kmeans.fit(X_scaled)

# Predict clusters
y_kmeans = kmeans.predict(X_scaled)

Evaluating the Model

Evaluate the model using the inertia and silhouette score.

# Inertia
print("Inertia:", kmeans.inertia_)

# Silhouette Score
score = silhouette_score(X_scaled, y_kmeans)
print("Silhouette Score:", score)

Visualizing the Clusters

Visualize the clusters and centroids using matplotlib.

import matplotlib.pyplot as plt

plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Advanced Topics in K-Means Clustering

Mini-Batch K-Means

Mini-Batch K-Means is a variant of K-Means that uses mini-batches to reduce computation time, making it suitable for large datasets.

from sklearn.cluster import MiniBatchKMeans

# Initialize MiniBatchKMeans
mbkmeans = MiniBatchKMeans(n_clusters=3, batch_size=100)
mbkmeans.fit(X_scaled)

# Predict clusters
y_mbkmeans = mbkmeans.predict(X_scaled)

Handling Outliers

Outliers can significantly impact the performance of K-Means clustering. Identifying and removing outliers before clustering can improve results.

from sklearn.neighbors import LocalOutlierFactor

# Identify outliers
lof = LocalOutlierFactor()
outliers = lof.fit_predict(X_scaled)

# Remove outliers
X_cleaned = X_scaled[outliers == 1]

Using Different Distance Metrics

The default distance metric in K-Means is Euclidean distance. However, other metrics like Manhattan distance can be used depending on the dataset and problem.

from sklearn.metrics.pairwise import manhattan_distances

class CustomKMeans(KMeans):
def _e_step(self, X):
labels = manhattan_distances(X, self.cluster_centers_, squared=True).argmin(axis=1)
return labels

# Initialize CustomKMeans
custom_kmeans = CustomKMeans(n_clusters=3)
custom_kmeans.fit(X_scaled)

Practical Applications of K-Means Clustering

Market Segmentation

K-Means clustering is widely used in market segmentation to identify distinct customer groups based on purchasing behavior, demographics, and other factors.

# Example: Customer data
customers = np.array([
[23, 50000],
[25, 60000],
[31, 120000],
[35, 150000],
[40, 200000],
[60, 45000],
[70, 70000]
])

# Normalize the data
scaler = StandardScaler()
customers_scaled = scaler.fit_transform(customers)

# Cluster the data
kmeans = KMeans(n_clusters=3)
kmeans.fit(customers_scaled)

# Predict clusters
customer_segments = kmeans.predict(customers_scaled)
print(customer_segments)

Image Compression

K-Means can be used for image compression by reducing the number of colors in an image. Each pixel is assigned to the nearest cluster centroid, which represents a color.

from sklearn.utils import shuffle
import matplotlib.image as mpimg

# Load image
image = mpimg.imread('image.jpg')
image = np.array(image, dtype=np.float64) / 255

# Reshape the image
w, h, d = image.shape
image_array = np.reshape(image, (w * h, d))

# Fit KMeans
kmeans = KMeans(n_clusters=64)
kmeans.fit(image_array)

# Predict cluster for each pixel
labels = kmeans.predict(image_array)
compressed_image = kmeans.cluster_centers_[labels].reshape(w, h, d)

# Display compressed image
plt.imshow(compressed_image)
plt.show()

Document Clustering

K-Means is used in document clustering to group similar documents together based on their content, which is useful in information retrieval and text mining.

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
"This is a document about machine learning.",
"Another document discussing artificial intelligence.",
"Text data can be clustered using KMeans.",
"KMeans clustering is popular in data science.",
"Machine learning and data science are closely related."
]

# Convert documents to TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer.fit_transform(documents)

# Cluster documents
kmeans = KMeans(n_clusters=2)
kmeans.fit(X_tfidf)

# Predict clusters
document_clusters = kmeans.predict(X_tfidf)
print(document_clusters)

Conclusion

K-Means clustering is a versatile and widely used algorithm in machine learning for partitioning datasets into clusters. By understanding the principles behind K-Means, choosing the appropriate number of clusters, and implementing the algorithm in Python, you can effectively apply clustering to various practical problems. From market segmentation to image compression and document clustering, K-Means offers robust solutions for uncovering hidden patterns in data.

Leave a Comment