konakona

Principal Component Analysis, Explained Intuitively

An intuitive explanation of PCA for dimensionality reduction, covering variance maximization, covariance matrices, and the spectral theorem.

Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction and feature extraction. Its goal is to find a new set of orthogonal axes (principal components) in the feature space onto which we can project our data, such that we can reduce the number of dimensions (axes) while preserving as much information as possible.

Analyzing the Goal

To achieve PCA, we first need to define what we mean by “preserving as much information as possible”. A common approach is to maximize the variance of the projected data along the new axes. The intuition is that directions with higher variance capture more of the data’s structure and variability, while directions with low variance may represent noise or less important features. Because variance is a quantitative measure of how spread out the data points are, we now have a tool for measuring the “quality” of a new axis: the higher the variance of the data projected onto that axis, the better that axis is at capturing the information in the data.

Additionally, we want the new axes (principal components) to be uncorrelated with each other, meaning that they exhibit no linear dependence. Intuitively, correlation indicates overlapping linear information: changes along one direction partially explain changes along another. By choosing uncorrelated axes, each principal component captures a unique linear aspect of the data’s variability, allowing us to reduce dimensionality without unnecessary redundancy.

Thus, our goal in PCA can be summarized as follows:

  1. Find a set of new axes (principal components) such that the variance of the data projected onto these axes is maximized.
  2. Ensure that these new axes are uncorrelated with each other.
  3. Order the principal components by the amount of variance they capture, so that we can select the top kk components for dimensionality reduction.

Now that we have a high-level understanding of PCA’s objectives, we can dive into the mathematics behind it.

Notation and Setup

Let xRd\mathbf{x} \in \mathbb{R}^d be a random vector representing a data point with dd features. We can think of each realization of x\mathbf{x} as a point in a dd-dimensional space. The collection of all such points forms a “cloud” whose shape reflects the underlying distribution of the data.

Given nn samples, we arrange them as columns in a data matrix X=[x1,x2,,xn]Rd×n\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n] \in \mathbb{R}^{d \times n}.

We represent each data point as a column vector to be consistent with linear algebra conventions. This differs from the common ML convention of using row vectors.

Why Centering Matters

Before applying PCA, we center the data by subtracting the sample mean:

xˉ=1ni=1nxi,x~i=xixˉ\bar{\mathbf{x}} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_i, \quad \tilde{\mathbf{x}}_i = \mathbf{x}_i - \bar{\mathbf{x}}

This gives us centered data X~=[x~1,,x~n]\tilde{\mathbf{X}} = [\tilde{\mathbf{x}}_1, \ldots, \tilde{\mathbf{x}}_n] with E[x~]=0E[\tilde{\mathbf{x}}] = \mathbf{0}.

Why is centering essential? In PCA, we search for directions (unit vectors) that maximize variance. A direction is a line through the origin—it starts from 0\mathbf{0}. When we project data onto a direction w\mathbf{w}, we measure how spread out the projections are along that line through the origin.

If the data’s centroid (mean) is not at the origin, this measurement becomes meaningless. Imagine your data cloud is centered at some point c\mathbf{c} far from the origin. A direction w\mathbf{w} that passes through the origin might completely miss the data cloud, or intersect it at an arbitrary angle unrelated to the cloud’s actual shape. The “variance” you compute would reflect the distance from the origin to the data, not the spread within the data.

By centering, we move the data’s centroid to coincide with the origin. Now, directions through the origin pass through the “center” of the data cloud, and variance along a direction genuinely measures the spread of data in that direction.

From now on, we assume all data is centered and drop the tilde for simplicity.

Geometric Intuition

Imagine the data cloud in Rd\mathbb{R}^d. It might be elongated in some directions and compressed in others—like an ellipsoid rather than a sphere. PCA aims to find the axes of this ellipsoid: the directions along which the data is most spread out.

The first principal component is the direction along which projecting the data yields the largest variance. The second principal component is the direction (orthogonal to the first) that captures the next largest variance, and so on.

The Covariance Matrix

To formalize this, we first need to introduce the covariance matrix Σ\Sigma of the random vector x\mathbf{x}:

Cov(x)=E[(xE[x])(xE[x])T]\text{Cov}(\mathbf{x}) = E[(\mathbf{x} - E[\mathbf{x}])(\mathbf{x} - E[\mathbf{x}])^T]

Since the data is centered, E[x]=0E[\mathbf{x}] = \mathbf{0}, so:

Cov(x)=E[(x0)(x0)T]=E[xxT]=Σ\text{Cov}(\mathbf{x}) = E[(\mathbf{x} - \mathbf{0})(\mathbf{x} - \mathbf{0})^T] = E[\mathbf{x}\mathbf{x}^T] = \Sigma

The diagonal entries Σii=E[xi2]\Sigma_{ii} = E[x_i^2] are the variances of individual features, and the off-diagonal entries Σij=E[xixj]\Sigma_{ij} = E[x_i x_j] are the covariances between features.

It might not be immediately obvious why the covariance matrix is useful, but it sets the stage for finding the optimal directions (principal components) by analyzing the properties of the covariance matrix Σ\Sigma.

Empirical covariance matrix. In practice, we estimate Σ\Sigma from the data: Σ1ni=1nxixiT=1nXXT\Sigma \approx \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_i \mathbf{x}_i^T = \frac{1}{n} \mathbf{X} \mathbf{X}^T

Properties of the Covariance Matrix

Before finding principal components, we need to understand key properties of Σ\Sigma:

Σ\Sigma is symmetric. Since Σij=E[xixj]=E[xjxi]=Σji\Sigma_{ij} = E[x_i x_j] = E[x_j x_i] = \Sigma_{ji}, the matrix is symmetric: Σ=ΣT\Sigma = \Sigma^T.

The Spectral Theorem. Any real symmetric matrix can be diagonalized by an orthogonal matrix. Specifically, there exists an orthogonal matrix W\mathbf{W} (where WTW=WWT=I\mathbf{W}^T \mathbf{W} = \mathbf{W}\mathbf{W}^T = \mathbf{I}) such that:

WTΣW=Λ\mathbf{W}^T \Sigma \mathbf{W} = \Lambda

where Λ\Lambda is a diagonal matrix of eigenvalues and the columns of W\mathbf{W} are the corresponding eigenvectors. Again, it might not be clear why the fact that Σ\Sigma can be diagonalized is important, but it will become clear as we proceed.

Finding All Principal Components at Once

Rather than finding principal components one by one, we can leverage the spectral theorem to find them all simultaneously. Let W=[w1,w2,,wd]\mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_d] be the orthogonal matrix of eigenvectors of Σ\Sigma, ordered by decreasing eigenvalue:

Λ=[λ1000λ2000λd],λ1λ2λd0\Lambda = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_d \end{bmatrix}, \quad \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_d \geq 0

The Transformation and Its Covariance

The transformation z=WTx\mathbf{z} = \mathbf{W}^T \mathbf{x} projects the original data onto the new axes defined by the eigenvectors. Each component zi=wiTxz_i = \mathbf{w}_i^T \mathbf{x} is the projection onto the ii-th axis.

What is the covariance matrix of the transformed data z\mathbf{z}? Since z=WTx\mathbf{z} = \mathbf{W}^T \mathbf{x} and E[z]=WTE[x]=0E[\mathbf{z}] = \mathbf{W}^T E[\mathbf{x}] = \mathbf{0}:

Cov(z)=E[zzT]=E[(WTx)(WTx)T]=E[WTxxTW]=WTE[xxT]W=WTΣW\text{Cov}(\mathbf{z}) = E[\mathbf{z}\mathbf{z}^T] = E[(\mathbf{W}^T \mathbf{x})(\mathbf{W}^T \mathbf{x})^T] = E[\mathbf{W}^T \mathbf{x} \mathbf{x}^T \mathbf{W}] = \mathbf{W}^T E[\mathbf{x}\mathbf{x}^T] \mathbf{W} = \mathbf{W}^T \Sigma \mathbf{W}

By the spectral theorem, WTΣW=Λ\mathbf{W}^T \Sigma \mathbf{W} = \Lambda. Therefore:

Cov(z)=Λ\text{Cov}(\mathbf{z}) = \Lambda

This shows that with the transformation defined by the eigenvector matrix W\mathbf{W}, the covariance matrix of the transformed data z\mathbf{z} is diagonal. Quick-minded readers will notice that a diagonal covariance matrix has two important implications:

  • The diagonal entries Λii=λi\Lambda_{ii} = \lambda_i are the variances: Var(zi)=λi\text{Var}(z_i) = \lambda_i
  • The off-diagonal entries are all zero: Cov(zi,zj)=0\text{Cov}(z_i, z_j) = 0 for iji \neq j

This tells us two crucial things:

  1. The transformed components are uncorrelated—exactly what we wanted!
  2. The variance of each component equals the corresponding eigenvalue.

Identifying Principal Components

We just showed that when we use the eigenvector matrix W\mathbf{W} to define new axes, the variance along the ii-th axis equals λi\lambda_i. Since λ1λ2λd\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_d, the direction with maximum variance is w1\mathbf{w}_1—the eigenvector corresponding to the largest eigenvalue λ1\lambda_1.

Finding the first principal component is simply extracting the first column of the transformation matrix W\mathbf{W}:

w1=first column of W,Var(w1Tx)=λ1\mathbf{w}_1 = \text{first column of } \mathbf{W}, \quad \text{Var}(\mathbf{w}_1^T \mathbf{x}) = \lambda_1

Similarly, the second principal component w2\mathbf{w}_2 (the direction with the second highest variance, orthogonal to w1\mathbf{w}_1) is the second column of W\mathbf{W}, and so on.

The eigenvector matrix W\mathbf{W} simultaneously gives us:

  • All principal component directions (as columns)
  • The assurance that these directions are uncorrelated (off-diagonal covariance is zero)
  • The variance captured by each (the corresponding eigenvalues)

It might seem surprising that the eigenvectors of the covariance matrix yield exactly what we want for PCA. This can be understood from an optimization perspective: finding the first principal component is equivalent to solving: maxw=1Var(wTx)\max_{\|\mathbf{w}\|=1} \mathrm{Var}(\mathbf{w}^T \mathbf{x}). One can show that the solution to this problem is the eigenvector of Σ\Sigma corresponding to its largest eigenvalue, thereby providing a direct connection to the spectral theorem. The details of this derivation are beyond the scope of this section.

Dimensionality Reduction

To reduce from dd dimensions to kk dimensions (where k<dk < d), we select the top kk eigenvectors forming Wk=[w1,,wk]Rd×k\mathbf{W}_k = [\mathbf{w}_1, \ldots, \mathbf{w}_k] \in \mathbb{R}^{d \times k}:

zreduced=WkTxRk\mathbf{z}_{\text{reduced}} = \mathbf{W}_k^T \mathbf{x} \in \mathbb{R}^k

This retains the kk directions with the highest variance, discarding the rest. Our work to implement PCA is now complete! We have successfully derived a method to find principal components that captures maximum information while ensuring uncorrelatedness.

Summary

PCA finds orthogonal directions (principal components) that maximize variance:

  1. Center the data by subtracting the mean (so the centroid is at the origin).
  2. Compute the covariance matrix Σ=1nXXT\Sigma = \frac{1}{n} \mathbf{X} \mathbf{X}^T.
  3. Diagonalize Σ\Sigma to obtain eigenvalues λi\lambda_i and eigenvectors wi\mathbf{w}_i.
  4. Sort eigenvectors by decreasing eigenvalue.
  5. Select the top kk eigenvectors to form Wk\mathbf{W}_k.
  6. Transform the data: z=WkTx\mathbf{z} = \mathbf{W}_k^T \mathbf{x}.