Principal Component Analysis, Explained Intuitively
An intuitive explanation of PCA for dimensionality reduction, covering variance maximization, covariance matrices, and the spectral theorem.
Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction and feature extraction. Its goal is to find a new set of orthogonal axes (principal components) in the feature space onto which we can project our data, such that we can reduce the number of dimensions (axes) while preserving as much information as possible.
Analyzing the Goal
To achieve PCA, we first need to define what we mean by “preserving as much information as possible”. A common approach is to maximize the variance of the projected data along the new axes. The intuition is that directions with higher variance capture more of the data’s structure and variability, while directions with low variance may represent noise or less important features. Because variance is a quantitative measure of how spread out the data points are, we now have a tool for measuring the “quality” of a new axis: the higher the variance of the data projected onto that axis, the better that axis is at capturing the information in the data.
Additionally, we want the new axes (principal components) to be uncorrelated with each other, meaning that they exhibit no linear dependence. Intuitively, correlation indicates overlapping linear information: changes along one direction partially explain changes along another. By choosing uncorrelated axes, each principal component captures a unique linear aspect of the data’s variability, allowing us to reduce dimensionality without unnecessary redundancy.
Thus, our goal in PCA can be summarized as follows:
- Find a set of new axes (principal components) such that the variance of the data projected onto these axes is maximized.
- Ensure that these new axes are uncorrelated with each other.
- Order the principal components by the amount of variance they capture, so that we can select the top components for dimensionality reduction.
Now that we have a high-level understanding of PCA’s objectives, we can dive into the mathematics behind it.
Notation and Setup
Let be a random vector representing a data point with features. We can think of each realization of as a point in a -dimensional space. The collection of all such points forms a “cloud” whose shape reflects the underlying distribution of the data.
Given samples, we arrange them as columns in a data matrix .
We represent each data point as a column vector to be consistent with linear algebra conventions. This differs from the common ML convention of using row vectors.
Why Centering Matters
Before applying PCA, we center the data by subtracting the sample mean:
This gives us centered data with .
Why is centering essential? In PCA, we search for directions (unit vectors) that maximize variance. A direction is a line through the origin—it starts from . When we project data onto a direction , we measure how spread out the projections are along that line through the origin.
If the data’s centroid (mean) is not at the origin, this measurement becomes meaningless. Imagine your data cloud is centered at some point far from the origin. A direction that passes through the origin might completely miss the data cloud, or intersect it at an arbitrary angle unrelated to the cloud’s actual shape. The “variance” you compute would reflect the distance from the origin to the data, not the spread within the data.
By centering, we move the data’s centroid to coincide with the origin. Now, directions through the origin pass through the “center” of the data cloud, and variance along a direction genuinely measures the spread of data in that direction.
From now on, we assume all data is centered and drop the tilde for simplicity.
Geometric Intuition
Imagine the data cloud in . It might be elongated in some directions and compressed in others—like an ellipsoid rather than a sphere. PCA aims to find the axes of this ellipsoid: the directions along which the data is most spread out.
The first principal component is the direction along which projecting the data yields the largest variance. The second principal component is the direction (orthogonal to the first) that captures the next largest variance, and so on.
The Covariance Matrix
To formalize this, we first need to introduce the covariance matrix of the random vector :
Since the data is centered, , so:
The diagonal entries are the variances of individual features, and the off-diagonal entries are the covariances between features.
It might not be immediately obvious why the covariance matrix is useful, but it sets the stage for finding the optimal directions (principal components) by analyzing the properties of the covariance matrix .
Empirical covariance matrix. In practice, we estimate from the data:
Properties of the Covariance Matrix
Before finding principal components, we need to understand key properties of :
is symmetric. Since , the matrix is symmetric: .
The Spectral Theorem. Any real symmetric matrix can be diagonalized by an orthogonal matrix. Specifically, there exists an orthogonal matrix (where ) such that:
where is a diagonal matrix of eigenvalues and the columns of are the corresponding eigenvectors. Again, it might not be clear why the fact that can be diagonalized is important, but it will become clear as we proceed.
Finding All Principal Components at Once
Rather than finding principal components one by one, we can leverage the spectral theorem to find them all simultaneously. Let be the orthogonal matrix of eigenvectors of , ordered by decreasing eigenvalue:
The Transformation and Its Covariance
The transformation projects the original data onto the new axes defined by the eigenvectors. Each component is the projection onto the -th axis.
What is the covariance matrix of the transformed data ? Since and :
By the spectral theorem, . Therefore:
This shows that with the transformation defined by the eigenvector matrix , the covariance matrix of the transformed data is diagonal. Quick-minded readers will notice that a diagonal covariance matrix has two important implications:
- The diagonal entries are the variances:
- The off-diagonal entries are all zero: for
This tells us two crucial things:
- The transformed components are uncorrelated—exactly what we wanted!
- The variance of each component equals the corresponding eigenvalue.
Identifying Principal Components
We just showed that when we use the eigenvector matrix to define new axes, the variance along the -th axis equals . Since , the direction with maximum variance is —the eigenvector corresponding to the largest eigenvalue .
Finding the first principal component is simply extracting the first column of the transformation matrix :
Similarly, the second principal component (the direction with the second highest variance, orthogonal to ) is the second column of , and so on.
The eigenvector matrix simultaneously gives us:
- All principal component directions (as columns)
- The assurance that these directions are uncorrelated (off-diagonal covariance is zero)
- The variance captured by each (the corresponding eigenvalues)
It might seem surprising that the eigenvectors of the covariance matrix yield exactly what we want for PCA. This can be understood from an optimization perspective: finding the first principal component is equivalent to solving: . One can show that the solution to this problem is the eigenvector of corresponding to its largest eigenvalue, thereby providing a direct connection to the spectral theorem. The details of this derivation are beyond the scope of this section.
Dimensionality Reduction
To reduce from dimensions to dimensions (where ), we select the top eigenvectors forming :
This retains the directions with the highest variance, discarding the rest. Our work to implement PCA is now complete! We have successfully derived a method to find principal components that captures maximum information while ensuring uncorrelatedness.
Summary
PCA finds orthogonal directions (principal components) that maximize variance:
- Center the data by subtracting the mean (so the centroid is at the origin).
- Compute the covariance matrix .
- Diagonalize to obtain eigenvalues and eigenvectors .
- Sort eigenvectors by decreasing eigenvalue.
- Select the top eigenvectors to form .
- Transform the data: .