Unsupervised Learning

Unsupervised learning is a type of machine learning where the algorithm learns from unlabelled data — there are no predefined correct answers. The goal is to discover hidden patterns, structures, or groupings in the data without human guidance on what the output should be.

How Unsupervised Learning Differs from Supervised Learning

Aspect	Supervised Learning	Unsupervised Learning
Data	Labelled (X, y)	Unlabelled (X only)
Goal	Predict known outputs	Discover hidden structure
Evaluation	Compare predictions to known labels	Domain knowledge, visual inspection, internal metrics
Examples	Spam detection, price prediction	Customer segmentation, anomaly detection

Clustering

Clustering groups similar data points together based on their features. Points within the same cluster are more similar to each other than to points in other clusters.

K-Means Clustering

K-Means is the most popular clustering algorithm. It partitions data into k clusters by iteratively assigning points to the nearest cluster centre and updating the centres.

How K-Means works:

Choose k (number of clusters)
Randomly initialise k cluster centres (centroids)
Assign each data point to the nearest centroid
Recalculate centroids as the mean of assigned points
Repeat steps 3-4 until convergence (centroids stop moving)

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Fit K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
kmeans.fit(X)

# Plot results
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', marker='X', s=200, label='Centroids')
plt.legend()
plt.title('K-Means Clustering')
plt.show()

Choosing the Right k — The Elbow Method

inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

The "elbow" in the plot — where adding more clusters stops significantly reducing inertia — suggests the optimal k.

Other Clustering Algorithms

Algorithm	Strengths	Weaknesses
K-Means	Fast, scalable, simple	Requires choosing k, assumes spherical clusters
DBSCAN	Finds arbitrarily shaped clusters, detects outliers	Sensitive to density parameters
Hierarchical	Produces a dendrogram, no need to prespecify k	Slow on large datasets
Gaussian Mixture	Soft clustering (probabilities), flexible shapes	Assumes Gaussian distributions
Mean Shift	No need to specify k, finds arbitrary shapes	Computationally expensive

DBSCAN Example

from sklearn.cluster import DBSCAN

# DBSCAN does not require specifying k
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Label -1 indicates noise / outlier points
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters found: {n_clusters}, Noise points: {n_noise}")

Dimensionality Reduction

Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while preserving as much meaningful structure as possible.

Unsupervised Learning

Unsupervised Learning

How Unsupervised Learning Differs from Supervised Learning

Clustering

K-Means Clustering

Choosing the Right k — The Elbow Method

Other Clustering Algorithms

DBSCAN Example

Dimensionality Reduction

Why Reduce Dimensions?

More in Data Science