Clustering Techniques
Clustering is an unsupervised learning technique used to group similar data points into meaningful
clusters without any prior labels. It helps in discovering hidden patterns or natural groupings in data.
For example, grouping customers by purchasing behavior, or organizing documents by topic. In clustering,
the goal is to ensure that data points within the same cluster are more similar to each other than to
those in other clusters. There are various clustering techniques like K-Means, Hierarchical Clustering,
and DBSCAN — each with its own approach to forming these groups.
K-Means Clustering
K-Means is one of the most popular unsupervised learning algorithms used to group similar data
points into clusters. It is mainly used in clustering problems where we don't have labeled data. The
goal is to divide the data into K groups (clusters) based on similarity.
Key Concepts
- Centroid: The center of a cluster.
- K: Number of clusters we want to form.
- Euclidean Distance: The distance between a data point and a centroid.
How K-Means Works (Incremental Method)
- Step 1: Choose the number of clusters K.
- Step 2: Randomly initialize K centroids.
- Step 3: Go through each data point one by one:
- Assign it to the nearest centroid (based on distance).
- Immediately update the centroid of that cluster using the mean of current
members.
- Step 4: Repeat this for all points. Optionally loop again if needed.
Distance Formula (1D)
When using just one feature like Shoe Size, distance between two points is the absolute difference:
\( d = |x_2 - x_1| \)
Example Dataset
Consider the following dataset of clients and their shoe sizes:
Client |
Shoe Size |
Alice |
6 |
Bob |
9 |
Charlie |
7 |
Diana |
5 |
Edward |
8 |
Fiona |
10 |
George |
11 |
K-Means (K = 2)
- Step 1: Choose K = 2 (we want 2 clusters).
- Step 2: Randomly select initial centroids:
- Centroid A = Alice = 6
- Centroid B = Bob = 9
Step-by-Step Assignment & Updating
- Point Charlie = 7:
- Distance to A = |7 - 6| = 1
- Distance to B = |7 - 9| = 2
- → Assign to Cluster A
- Update Centroid A = Average of (6, 7) = 6.5
- Point Diana = 5:
- Distance to A = |5 - 6.5| = 1.5
- Distance to B = |5 - 9| = 4
- → Assign to Cluster A
- Update Centroid A = Average of (6, 7, 5) = 6
- Point Edward = 8:
- Distance to A = |8 - 6| = 2
- Distance to B = |8 - 9| = 1
- → Assign to Cluster B
- Update Centroid B = Average of (9, 8) = 8.5
- Point Fiona = 10:
- Distance to A = |10 - 6| = 4
- Distance to B = |10 - 8.5| = 1.5
- → Assign to Cluster B
- Update Centroid B = Average of (9, 8, 10) = 9
- Point George = 11:
- Distance to A = |11 - 6| = 5
- Distance to B = |11 - 9| = 2
- → Assign to Cluster B
- Update Centroid B = Average of (9, 8, 10, 11) = 9.5
Final Clusters (After All Points)
- Cluster A (Centroid ≈ 6): Alice, Charlie, Diana
- Cluster B (Centroid ≈ 9.5): Bob, Edward, Fiona, George
Euclidean Distance Formula
This formula is used to measure how far two points are from each other in a 2D space (like age and
amount):
\( d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} \)
Example Dataset
Let’s consider the following dataset:
Client |
Age |
Amount |
C1 |
20 |
500 |
C2 |
40 |
1000 |
C3 |
30 |
800 |
C4 |
18 |
300 |
C5 |
28 |
1200 |
C6 |
35 |
1400 |
C7 |
45 |
1800 |
K-Means (K = 2)
- Step 1: Choose K = 2 (we want 2 clusters).
- Step 2: Randomly select 2 initial centroids:
- Centroid A = C1 = (20, 500)
- Centroid B = C2 = (40, 1000)
Step-by-Step Assignment & Updating
- Point C3 = (30, 800):
- Distance to A = √((30 - 20)² + (800 - 500)²) = √(100 + 90000) = √90100 ≈ 300.17
- Distance to B = √((30 - 40)² + (800 - 1000)²) = √(100 + 40000) = √40100 ≈ 200.25
- → Assign to Cluster B
- Update Centroid B = Average of (40,1000) and (30,800) = (35, 900)
- Point C4 = (18, 300):
- Distance to A = √((18 - 20)² + (300 - 500)²) = √(4 + 40000) = √40004 ≈ 200.01
- Distance to B = √((18 - 35)² + (300 - 900)²) = √(289 + 360000) = √360289 ≈ 600.24
- → Assign to Cluster A
- Update Centroid A = Average of (20,500) and (18,300) = (19, 400)
- Point C5 = (28, 1200):
- Distance to A = √((28 - 19)² + (1200 - 400)²) = √(81 + 640000) = √640081 ≈ 800.05
- Distance to B = √((28 - 35)² + (1200 - 900)²) = √(49 + 90000) = √90049 ≈ 300.08
- → Assign to Cluster B
- Update Centroid B = Avg of (40,1000), (30,800), (28,1200) = (32.67, 1000)
- Point C6 = (35, 1400):
- Distance to A = √((35 - 19)² + (1400 - 400)²) = √(256 + 1000000) = √1000256 ≈ 1000.13
- Distance to B = √((35 - 32.67)² + (1400 - 1000)²) = √(5.45 + 160000) = √160005.45 ≈ 400.01
- → Assign to Cluster B
- Update Centroid B = Avg of C2, C3, C5, C6 = (40+30+28+35)/4 = 133/4 = 33.25 and
(1000+800+1200+1400)/4 = 4400/4 = 1100
- → New Centroid B = (33.25, 1100)
- Point C7 = (45, 1800):
- Distance to A = √((45 - 19)² + (1800 - 400)²) = √(676 + 1960000) = √1960676 ≈ 1400.24
- Distance to B = √((45 - 33.25)² + (1800 - 1100)²) = √(138.06 + 490000) = √490138.06 ≈ 700.10
- → Assign to Cluster B
- Update Centroid B = Avg of C2, C3, C5, C6, C7 = (40+30+28+35+45)/5 = 178/5 = 35.6 and
(1000+800+1200+1400+1800)/5 = 6200/5 = 1240
- → Final Centroid B = (35.6, 1240)
Final Clusters (After All Points)
- Cluster A (Centroid ≈ 19, 400): C1, C4
- Cluster B (Centroid ≈ 35.6, 1240): C2, C3, C5, C6, C7
Silhouette Score (Clustering Evaluation)
After performing clustering (like K-Means), it's important to check how good the clustering
actually is.
One of the most popular ways to evaluate clustering quality is using the Silhouette
Score.
What is Silhouette Score?
Silhouette Score tells us how well a data point fits into its own cluster and how far it is from
the other clusters.
It gives a score between -1 and 1 for each point:
- +1: Perfectly assigned to its cluster.
- 0: On the boundary between two clusters.
- Negative: Possibly assigned to the wrong cluster.
Formula
\[
\text{For a single data point } i:
\]
Let:
\[
a(i) = \text{average distance from } i \text{ to all other points in the same cluster}
\]
\[
b(i) = \text{average distance from } i \text{ to all points in the nearest other cluster}
\]
Then:
\[
S(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
\]
Steps to Calculate Silhouette Score
- Step 1: Perform clustering (e.g., K-Means).
- Step 2: For each point:
- Find a(i) = avg. distance to other points in same cluster.
- Find b(i) = avg. distance to points in the nearest other cluster.
- Calculate silhouette score for that point.
- Step 3: Take the average of all S(i) to get the overall Silhouette Score.
Simple Example
Assume we have 4 points clustered into 2 clusters:
- Cluster A: P1 = (1, 2), P2 = (2, 2)
- Cluster B: P3 = (8, 8), P4 = (9, 9)
Let’s calculate Silhouette Score for P1:
- a(P1): Distance to P2
a = √((2-1)² + (2-2)²) = √(1) = 1
- b(P1): Avg distance to P3 and P4
d1 = √((8-1)² + (8-2)²) = √(49 + 36) = √85 ≈ 9.22
d2 = √((9-1)² + (9-2)²) = √(64 + 49) = √113 ≈ 10.63
b = (9.22 + 10.63) / 2 ≈ 9.93
-
Silhouette Score for P1:
S = (b - a) / max(a, b) = (9.93 - 1) / 9.93 ≈ 0.899
Interpretation
Since P1’s score is close to 1, it means it's nicely separated and properly clustered.
If most points have high scores, your clustering is good. If many have low or negative scores,
your clusters may overlap or be wrongly assigned.
Advantages of Silhouette Score
- Easy to understand and visualize.
- Works without ground truth labels.
- Can help decide the optimal number of clusters (K).
Limitations
- Computationally expensive on large datasets.
- Assumes convex cluster shapes (works best with spherical clusters).
Note:
You can also use libraries like sklearn.metrics.silhouette_score() in Python to
calculate this automatically.
Hierarchical Clustering
Hierarchical clustering builds a hierarchy of clusters. Unlike K-Means, it does not require
choosing the number of clusters (K) beforehand. There are two main approaches:
Types of Hierarchical Clustering
- Agglomerative (Bottom-Up): Start with each point as its own cluster, and keep
merging the closest pairs until one cluster remains.
- Divisive (Top-Down): Start with all points in one cluster, and keep splitting
it into smaller clusters.
Distance Measurement (Linkage Methods)
- Single Linkage: Distance between the closest pair of points in two clusters.
- Complete Linkage: Distance between the farthest pair of points in two clusters.
- Average Linkage: Average of all pairwise distances between points in two
clusters.
Example: Agglomerative Clustering
Given 1D data points: [1, 5, 8, 10, 19, 20]
Step-by-step using single linkage:
- Start with each point as its own cluster:
[1], [5], [8], [10], [19], [20]
- Find the closest pair (minimum distance):
Distance(19, 20) = 1 → Merge → [19, 20]
- Next closest:
Distance(8, 10) = 2 → Merge → [8, 10]
- Next closest:
Distance(5, 8) = 3 → Merge → [5, 8, 10]
- Next closest:
Distance(1, 5) = 4 → Merge → [1, 5, 8, 10]
- Final merge:
Distance(10, 19) = 9 → Merge all → [1, 5, 8, 10, 19, 20]
Dendrogram
A dendrogram is a tree-like diagram that shows the sequence of merges (or splits).
You can “cut” the dendrogram at any level to get the desired number of clusters.
Divisive Clustering (Top-Down)
This method starts with all points in one big cluster and splits them recursively.
It's less common and computationally heavier than agglomerative.
Steps:
- Start with all points in one cluster: [1, 5, 8, 10, 19, 20]
- Split into two clusters based on large distance gap → [1, 5, 8, 10] and [19, 20]
- Keep splitting each sub-cluster until each point is separate.
Advantages
- No need to pre-define K (number of clusters).
- Produces a full cluster hierarchy.
- Works well with small datasets and dendrograms help visualize structure.
Limitations
- Computationally expensive for large datasets.
- Not suitable for streaming or dynamic data.
- Once a merge/split happens, it can't be undone.
Note:
Use libraries like scipy.cluster.hierarchy in Python to create dendrograms and
perform
clustering automatically.
DBSCAN Clustering Algorithm
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering
algorithm
that groups together points that are closely packed, marking points that lie alone in low-density
regions
as noise. It is especially useful when clusters have irregular shapes.
Key Concepts
- Epsilon (ε): The maximum radius of the neighborhood around a point. Think of it
as
how far you look around a person to find friends.
- MinPts: Minimum number of points required to form a dense region (cluster).
Like
minimum people needed in a group to call it a crowd.
- Core Point: A point that has at least MinPts neighbors within ε radius.
- Border Point: A point that has fewer than MinPts neighbors but lies within the
neighborhood of a core point.
- Noise Point: A point that is neither core nor border (isolated points).
Real-Life Analogy: Mall Supervisor
Imagine you are a supervisor in a mall watching the crowd. You define your neighborhood radius (ε)
as
how far you look around a person to see if they are in a crowd. You decide that if a group has at
least
4 people (MinPts), you call it a crowd (cluster). If someone is with enough people nearby, they are
a
core point. If they are near a crowd but don't have enough people themselves, they are a
border point. People wandering alone far from any crowd are noise.
Example Dataset
Point |
X |
Y |
P1 |
4.5 |
8 |
P2 |
5 |
7 |
P3 |
6 |
6.5 |
P4 |
7 |
5 |
P5 |
9 |
4 |
P6 |
7 |
3 |
P7 |
8 |
3.5 |
P8 |
9 |
5 |
P9 |
4 |
4 |
P10 |
3 |
7.5 |
P11 |
4 |
6 |
P12 |
3.5 |
5 |
Parameters
- Epsilon (ε): 1.9 (radius to search neighbors)
- MinPts: 4 (minimum points to form a cluster)
Euclidean Distance Formula
To find distance between two points \(P_i = (x_i, y_i)\) and \(P_j = (x_j, y_j)\):
\( d = \sqrt{(x_j - x_i)^2 + (y_j - y_i)^2} \)
Distance Table (All Pairwise Distances)
Calculated distances between points (rounded to 2 decimals):
|
P1 |
P2 |
P3 |
P4 |
P5 |
P6 |
P7 |
P8 |
P9 |
P10 |
P11 |
P12 |
P1 |
0.00 |
1.12 |
2.12 |
3.91 |
6.02 |
5.59 |
5.70 |
5.41 |
4.03 |
1.58 |
2.06 |
3.16 |
P2 |
1.12 |
0.00 |
1.12 |
2.83 |
5.00 |
4.47 |
4.61 |
4.47 |
3.16 |
2.06 |
1.41 |
2.50 |
P3 |
2.12 |
1.12 |
0.00 |
1.80 |
3.91 |
3.64 |
3.61 |
3.35 |
3.20 |
3.16 |
2.06 |
2.92 |
P4 |
3.91 |
2.83 |
1.80 |
0.00 |
2.24 |
2.00 |
1.80 |
2.00 |
3.16 |
4.72 |
3.16 |
3.50 |
P5 |
6.02 |
5.00 |
3.91 |
2.24 |
0.00 |
2.24 |
1.12 |
1.00 |
5.00 |
6.95 |
5.39 |
5.59 |
P6 |
5.59 |
4.47 |
3.64 |
2.00 |
2.24 |
0.00 |
1.12 |
2.83 |
3.16 |
6.02 |
4.24 |
4.03 |
P7 |
5.70 |
4.61 |
3.61 |
1.80 |
1.12 |
1.12 |
0.00 |
1.80 |
4.03 |
6.40 |
4.72 |
4.74 |
P8 |
5.41 |
4.47 |
3.35 |
2.00 |
1.00 |
2.83 |
1.80 |
0.00 |
5.10 |
6.50 |
5.10 |
5.50 |
P9 |
4.03 |
3.16 |
3.20 |
3.16 |
5.00 |
3.16 |
4.03 |
5.10 |
0.00 |
3.64 |
2.00 |
1.12 |
P10 |
1.58 |
2.06 |
3.16 |
4.72 |
6.95 |
6.02 |
6.40 |
6.50 |
3.64 |
0.00 |
1.80 |
2.55 |
P11 |
2.06 |
1.41 |
2.06 |
3.16 |
5.39 |
4.24 |
4.72 |
5.10 |
2.00 |
1.80 |
0.00 |
1.12 |
P12 |
3.16 |
2.50 |
2.92 |
3.50 |
5.59 |
4.03 |
4.74 |
5.50 |
1.12 |
2.55 |
1.12 |
0.00 |
Step 1: Identify Neighbors of Each Point (Within ε = 1.9)
Check which points are within distance 1.9 for each point:
- P1: P2 (1.12), P10 (1.58) → 2 neighbors + P1 = 3 total
- P2: P1 (1.12), P3 (1.12), P11 (1.41) → 3 neighbors + P2 = 4 total
- P3: P2 (1.12), P4 (1.80) → 2 neighbors + P3 = 3 total
- P4: P3 (1.80), P7 (1.80) → 2 neighbors + P4 = 3 total
- P5: P7 (1.12), P8 (1.00) → 2 neighbors + P5 = 3 total
- P6: P7 (1.12) → 1 neighbor + P6 = 2 total
- P7: P4 (1.80), P5 (1.12), P6 (1.12), P8 (1.80) → 4 neighbors + P7 = 5 total
- P8: P5 (1.00), P7 (1.80) → 2 neighbors + P8 = 3 total
- P9: P12 (1.12) → 1 neighbor + P9 = 2 total
- P10: P1 (1.58), P11 (1.80) → 2 neighbors + P10 = 3 total
- P11: P2 (1.41), P10 (1.80), P12 (1.12) → 3 neighbors + P11 = 4 total
- P12: P9 (1.12), P11 (1.12) → 2 neighbors + P12 = 3 total
Step 2: Identify Core, Border, and Noise Points
- Core Points (≥ MinPts = 4 total points including itself):
- P2: Has 4 total points (P1, P3, P11, and itself) → Core
- P7: Has 5 total points (P4, P5, P6, P8, and itself) → Core
- P11: Has 4 total points (P2, P10, P12, and itself) → Core
- Border Points (less than 4 total points but within ε of a core point):
- P1: Connected to core point P2
- P3: Connected to core point P2
- P4: Connected to core point P7
- P5: Connected to core point P7
- P6: Connected to core point P7
- P8: Connected to core point P7
- P10: Connected to core point P11
- P12: Connected to core point P11
- Noise Points (not core and not connected to any core point):
- P9: Only connected to P12, but not within ε of any core point
Step 3: Form Clusters
Since core points P2 and P11 are connected through their mutual neighborhood (distance 1.41 < 1.9),
they form one large cluster:
- Cluster 1: P1, P2, P3, P10, P11, P12 (6 points)
- Core points: P2, P11
- Border points: P1, P3, P10, P12
- Cluster 2: P4, P5, P6, P7, P8 (5 points)
- Core point: P7
- Border points: P4, P5, P6, P8
Summary
- Cluster 1: P1, P2, P3, P10, P11, P12 (6 points)
- Cluster 2: P4, P5, P6, P7, P8 (5 points)
- Noise Points: P9 (1 point)
Why DBSCAN is Useful?
- It can find clusters of arbitrary shapes.
- It automatically identifies noise points.
- It does not require specifying the number of clusters upfront.
Dimensionality Reduction Techniques
When working with datasets that have a lot of features (or columns), it becomes tricky to visualize,
analyze, and build models effectively. That’s where dimensionality reduction techniques come in. In this
session, we'll explore what high-dimensional data is, why it's a problem, and how we can handle it using
methods like Principal Component Analysis (PCA).
Understanding the Curse of Dimensionality
The more features (also called dimensions) your dataset has, the more complex things get. This
complexity leads to a challenge often referred to as the "Curse of Dimensionality", or
more casually, the "Problem of Plenty".
What Does "Problem of Plenty" Mean?
It describes the difficulties that arise when datasets have too many dimensions. While more data sounds
good, too many features can actually make analysis harder instead of easier.
What Are Dimensions in Data?
In data science, each feature or column in your dataset is considered a dimension. Here's how dimensions
look visually:
- 1 Dimension: A straight line
- 2 Dimensions: A flat plane (like an x-y graph)

- 3 Dimensions: A cube or 3D space

- 4 or More Dimensions: Cannot be visualized easily — they exist only mathematically
Why Are Too Many Dimensions a Problem?
There are two main reasons:
1. Difficult Data Representation
- Visualizing data with many dimensions is nearly impossible
- Human intuition doesn’t work well beyond 3 dimensions
2. Poor Model Performance
High-dimensional data often causes a big difference between how well a model performs on the training
set and how poorly it performs on new, unseen data. This is known as the training vs. testing
performance gap.
- Models may "memorize" the training data
- But they fail to generalize to new data
- This issue is a direct result of the curse of dimensionality
Putting It All Together
All these issues — difficulty in visualizing, analyzing, and modeling — come together to form what we
call the "Curse of Dimensionality" or "Problem of Plenty".
Dimensionality Reduction to the Rescue: PCA
To deal with high-dimensional data, one of the most popular and effective techniques is
Principal Component Analysis (PCA).
What is PCA?
PCA is a method that helps us reduce the number of dimensions in our dataset while keeping as much
important information as possible. It transforms the original features into a smaller set of new
features called principal components.
Why Use PCA?
When working with data that has many features (like 50 or more), we often want to know:
- Which dimensions are actually useful for analysis?
- Which ones contribute the most to variation in the data?
- Which dimensions can be removed without losing important information?
How PCA Helps
- Dimension Selection: PCA identifies the features that carry the most meaningful
information
- Dimension Reduction: PCA creates new features (principal components) that summarize
the original data
- Information Preservation: It reduces the number of features while keeping the core
structure of the data
Important Concept: It's Not Just About Deleting Features
Reducing dimensions doesn’t mean randomly removing columns. Proper dimensionality reduction involves:
- Understanding which features are most important
- Keeping or transforming those important features
- Making sure we don’t lose key patterns in the process
- Preserving the quality and meaning of the original dataset
PCA's Role Summarized
- Finds which dimensions contribute most to data variation
- Creates new dimensions (principal components) that carry essential information
- Reduces overall dimensionality while keeping the important structure intact
Low Variance Filter
The Low Variance Filter is a simple and effective dimensionality reduction technique used during
data preprocessing. It removes features (columns) from the dataset that show very little variation
across observations.
These features don’t contribute much in distinguishing data points and can be safely removed.
Key Concept
- Variance: A measure of how much the values of a feature differ from the mean.
Low variance means most values are similar.
- Low Variance Filter: If the variance of a feature is lower than a threshold
(e.g. 0.01), that feature is removed.
Why Use It?
- Speeds up training by reducing number of features.
- Eliminates features that carry almost no information.
- Reduces risk of overfitting by removing irrelevant data.
How It Works
- Compute variance of each feature.
- Compare variance with a defined threshold (e.g., 0.01).
- Drop features whose variance is below the threshold.
Example Dataset
Suppose we have the following dataset:
Feature A |
Feature B |
Feature C |
1 |
7 |
0 |
1 |
9 |
0 |
1 |
5 |
0 |
How to Calculate Variance
Variance measures how spread out the values are. Here's the formula:
\( \text{Variance} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 \)
where \( \mu \) is the mean of the values.
Feature B Values: 7, 9, 5
- Step 1: Mean (μ) = (7 + 9 + 5) / 3 = 7.0
- Step 2: Squared Differences from Mean:
- (7 - 7)2 = 0
- (9 - 7)2 = 4
- (5 - 7)2 = 4
- Step 3: Variance = (0 + 4 + 4) / 3 = 2.67
Feature A: All values = 1 → Variance = 0
Feature C: All values = 0 → Variance = 0
Filter Step
If threshold = 0.1, drop features with variance < 0.1 → Drop Feature A and Feature C.
When to Use
- Before training any machine learning model.
- During feature selection pipeline.
- When dataset has many irrelevant or constant columns.
Limitations
- Doesn’t consider feature importance or target variable.
- Only removes features with low variance, not redundant ones with high variance.
- May miss removing uninformative features that still vary.
Note:
A good practice is to combine Low Variance Filter with other feature selection methods like
Correlation Filter, Backward Elimination, or PCA for better results.
High Correlation Filter
The High Correlation Filter is a feature selection technique used to eliminate redundant features
that are strongly correlated with each other. When two or more features have a high correlation,
they carry similar information. Keeping all of them leads to redundancy and may cause overfitting.
Key Concept
- Correlation: A statistical measure that expresses the extent to which two
features change together.
- Correlation Coefficient (r): Value between -1 and +1 indicating direction and
strength of correlation.
- +1 → Perfect positive correlation
- -1 → Perfect negative correlation
- 0 → No correlation
- Threshold: If |correlation| between two features > 0.9 (or chosen
threshold), remove one of them.
Why Use It?
- Removes redundant features with similar behavior.
- Reduces multicollinearity (problem for linear models).
- Improves model generalization and performance.
How It Works
- Compute correlation matrix of all features.
- Identify feature pairs with correlation above a chosen threshold (e.g., 0.9).
- Drop one feature from each highly correlated pair.
Example
Suppose we have the following dataset:
Feature X |
Feature Y |
Feature Z |
1 |
2 |
5 |
2 |
4 |
6 |
3 |
6 |
7 |
4 |
8 |
8 |
5 |
10 |
9 |
Correlation matrix of the features:
|
X |
Y |
Z |
X |
1.00 |
1.00 |
0.99 |
Y |
1.00 |
1.00 |
0.99 |
Z |
0.99 |
0.99 |
1.00 |
As we can see, the correlation between Feature X and Feature Y is 1.00, which means
they are
perfectly positively correlated (Y = 2 × X). Therefore, we can drop either X or Y to remove
redundancy.
When to Use
- When dataset has many numeric features.
- During preprocessing before model training.
- Especially useful in linear models (like regression) to avoid multicollinearity.
Limitations
- Only works on numeric features.
- Does not consider relationship with target variable.
- Sometimes both correlated features may still be important for prediction (use with care).
Note:
High Correlation Filter is usually used after removing constant/low variance features and before
applying more complex techniques like PCA or Recursive Feature Elimination.
Backward Feature Elimination
Backward Feature Elimination is a wrapper-based feature selection technique where we start with all
features
and remove the least useful ones one by one. The goal is to find the smallest subset of features
that still give the best model performance.
How It Works
- Start with all features in the dataset.
- Train the model and evaluate performance (e.g., using accuracy, R², etc.).
- Remove the feature that has the least impact on model performance (least significant).
- Repeat until removing more features worsens performance or desired number of features is
reached.
Why Use It?
- Improves model performance by removing noise.
- Helps reduce overfitting.
- Makes the model simpler and faster to train.
- Especially useful when you have a lot of features.
Example (Simple)
Suppose we are trying to predict a student's performance using these features:
- Hours Studied
- Attendance
- Previous Grades
- Favorite Color
We train a model using all 4 features. During backward elimination, we find that “Favorite Color”
does not affect prediction. So we eliminate it.
Then we check again and remove the next least useful feature if performance stays the same or
improves.
When to Stop?
- When removing more features significantly reduces model accuracy.
- Or when a fixed number of top features is selected (e.g. top 5).
Limitations
- Computationally expensive with many features.
- Doesn't work well with correlated features (multicollinearity).
- May remove a feature that is weak alone but strong in combination with others.
Tips
- Best used with linear models like Linear Regression or Logistic Regression.
- Combine with other techniques like Variance Filter or PCA for better results.
When to Use
- You have too many features and need to simplify the model.
- You want to keep only the most impactful variables.
- You want a more interpretable model with fewer predictors.
Principal Component Analysis (PCA):
Principal Component Analysis (PCA) might sound technical, but it’s really just a clever way of making
large datasets easier to understand. When you have too many features or variables, PCA helps you
simplify without losing the big picture. Think of it as a method to uncover the "essence" of your
data.
Why Do We Even Need PCA?
The Overfitting Problem
Let’s say you’re training a machine learning model and you decide to feed it every possible feature.
More features = more power, right? Not quite. Too many features can lead to overfitting—where the
model becomes excellent at memorizing training data but terrible at generalizing to new data.
- Models trained with too many features pick up noise instead of real patterns.
- This leads to great performance on training data but poor results on unseen data.
Visualizing Overfitting
Imagine a scatter plot. A well-fitted model draws a line that follows the trend. An overfitted model,
however, draws a squiggly line that tries to hit every single point, including outliers. It looks
fancy but performs poorly.
- Well-fitted model: Captures trends, ignores noise.
- Overfitted model: Matches every point, including irrelevant ones.
This is where PCA becomes a game-changer.
What PCA Brings to the Table
The Idea Behind PCA
PCA suggests a smarter way: “Let’s not use all features. Instead, let’s create new ones that
summarize the important parts of the data.”
- Reduces the number of features (called dimensions).
- Captures the essential patterns.
- Makes models faster, simpler, and often more accurate.
How PCA Works—The Concept
Think of your dataset as a 3D object. PCA looks at it from the best angles and creates simpler
versions (like 2D shadows) that still carry most of the information. Each new "view" is called a
principal component.
How PCA Actually Works (Step by Step)
Step 1: Finding the First Principal Component (PC1)
- PCA looks for the direction where the data varies the most.
- It projects all data onto a line in that direction.
- This line becomes PC1—the first and most important new feature.
Step 2: Adding the Second Component (PC2)
- PC2 is found by rotating 90° from PC1—it’s completely independent.
- It captures the second-most variation in the data.
Step 3: More Components (PC3, PC4, ...)
- PCA keeps going until it has as many components as the original features.
- Each one captures a little less variation than the one before.
- All components are orthogonal (i.e., independent) of each other.
Behind the Scenes: What PCA Really Does
- Takes messy, correlated features.
- Transforms them into neat, uncorrelated ones.
- You can now focus on just the most useful features and ignore the rest.
Key Principles of PCA
1. You Can’t Create More Than You Start With
- If you have 5 original features, you can get a maximum of 5 principal components.
- But most of the time, just a few components will carry most of the value.
Example: From 100 original features, PCA might find that 10 components explain 95%
of the data. You can safely use just those 10.
2. Each Component Has Its Rank
- PC1 = most important (explains the most variation).
- PC2 = second-most important, and so on.
Pro Tip: In many real-world problems, the first two or three components often hold
most of the insight.
3. Components Are Independent (Orthogonal)
- No component can be predicted from another.
- Each adds unique, non-redundant information.
This independence is what makes PCA so effective—it cuts out repetition and makes your models more
efficient.
The PCA Workflow: Before, During, and After
Before PCA
- Your dataset is huge with too many features.
- Overfitting is common and performance drops on new data.
- The data is hard to visualize and interpret.
During PCA
- Analysis: Looks for patterns and variation.
- Transformation: Creates new features (principal components).
- Ranking: Orders them by how much variation they explain.
- Selection: You decide how many components to keep.
After PCA
- Cleaner data: Fewer, more meaningful features.
- Better model performance: Generalizes well.
- Faster computation: Less data = quicker results.
- Improved visualization: Easier to plot and explore.
- Noise reduction: Irrelevant features get filtered out.
Where PCA Is Used in the Real World
1. Image Compression & Recognition
- High-resolution images can have millions of features (pixels).
- PCA reduces this to a few hundred components.
- The image still looks the same to the human eye but is easier to process.
2. Customer Behavior Analysis
- Businesses collect tons of customer data.
- PCA helps uncover the key drivers of customer decisions.
- This insight helps improve marketing and product strategy.
3. Gene Expression Analysis
- Biologists study thousands of genes per patient sample.
- PCA finds which gene patterns matter most.
- It speeds up research and highlights meaningful biological clues.
Wrapping Up: Why Learn PCA?
PCA isn’t just a math trick—it’s a practical tool that solves real-world problems in data science.
Whether you’re cleaning up your dataset, improving model performance, or just trying to make sense
of complicated data, PCA is a great skill to have in your toolkit.
Projection Methods (Dimension Reduction)
Projection methods are powerful techniques used in machine learning and data analysis to reduce the
number of features (dimensions) in a dataset while retaining the most important information. This
helps in visualizing data, reducing storage/computation, and improving model performance.
In simpler terms, it's like taking a 3D object and showing it as a 2D shadow on a wall — you lose
some details, but keep the overall shape.
Why Dimension Reduction?
- To remove redundant or less useful features.
- To simplify models and reduce overfitting.
- To visualize high-dimensional data (e.g., plot 10D data in 2D).
- To improve training time and model performance.
Key Concepts
- High-Dimensional Data: Data with many features (e.g., 100 columns).
- Projection: Mapping high-dimensional data to a lower-dimensional space.
- Principal Components: New features (axes) that capture the most variance in
data.
- Variance: A measure of how much the data spreads out — more variance = more
information.
Popular Projection Techniques
- Principal Component Analysis (PCA): Projects data onto new axes that capture
maximum variance.
- Linear Discriminant Analysis (LDA): Similar to PCA, but considers class labels
to maximize class separability.
- Random Projection: Uses random matrices to reduce dimensions with minimal loss
of information.
How PCA Works (Step-by-Step)
- Step 1: Standardize the dataset (mean = 0, variance = 1).
- Step 2: Calculate the covariance matrix.
- Step 3: Find eigenvalues and eigenvectors of the covariance matrix.
- Step 4: Choose top-k eigenvectors (those with highest eigenvalues).
- Step 5: Project the data onto these top-k components.
Example Dataset
Let's consider this simple dataset of students based on their Math and English scores:
Student |
Math |
English |
Alice |
90 |
85 |
Bob |
70 |
65 |
Charlie |
60 |
60 |
Diana |
95 |
90 |
Edward |
80 |
70 |
Fiona |
85 |
80 |
Visual Intuition (2D to 1D Projection)
Imagine this dataset as points on a 2D graph (Math vs English). PCA will find a new line (principal
component) that best fits this spread of points.
We can then project each point onto this line and represent each student using just 1 value
(coordinate on this line), instead of 2 (Math and English).
Mathematical Insight
The direction of the new axis (principal component) is chosen such that:
- It passes through the mean of the data.
- It captures the maximum variance in the data.
- It is orthogonal (perpendicular) to other components.
Covariance Matrix Formula
Covariance measures how two features vary together. For two features X and Y:
\( \text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) \)
Eigenvalues & Eigenvectors
These help us understand the directions (eigenvectors) and the importance (eigenvalues) of those
directions.
- Higher eigenvalue → more variance along that direction.
- Choose top eigenvectors to keep the most important directions.
Final Projection
Multiply the original data matrix with the top-k eigenvectors (components) to get the transformed
lower-dimensional data.
Example Output (After PCA with 1 Component)
Each student now has a single score (Principal Component 1) that represents a combination of Math and
English:
Student |
PC1 Score |
Alice |
1.5 |
Bob |
-1.2 |
Charlie |
-2.0 |
Diana |
2.0 |
Edward |
0.3 |
Fiona |
1.0 |
Benefits of PCA
- Reduces noise in the dataset.
- Helps in visualization and pattern discovery.
- Makes machine learning models faster and more efficient.
Limitations
- PCA is a linear technique — it might not capture complex (non-linear) patterns.
- It may make interpretation of features difficult after transformation.
Real-Life Analogy
Imagine you have a high-resolution photo (lots of pixels = high dimension). You want to print a small
version for your wallet — you shrink it (dimension reduction) while keeping the main idea (faces,
colors) intact.