PCA computes a latent representation of a data matrix $\mathbf{X}$ via a projection onto an orthogonal matrix $V$: $\mathbf{X} \mapsto \mathbf{Z}:=\mathbf{X}V.$ The columns of $V$ are also known as the **principal components (PC)**, and $V$ itself the **loading matrix**. $\mathbf{Z}$ is the projections of $\mathbf{X}$ by $V$: $z_{ij}=\text{projection of }\mathbf{X}_{i,:}\text{ onto }v_{j}.$ In particular, the $V$ used by PCA is the matrix $V$ in the full [[Singular Value Decomposition|SVD]] of $X=UD V^{T}$ (alternatively, it is the eigenvector matrix in the spectral decomposition of $X^{T}X$); therefore, $\mathbf{Z}=UDV^{T}V=UD.$ If we perform **dimension reduction** via a **truncated PCA** by only keeping information encoded by the first $k$ PCs, the compressed data is $\hat{\mathbf{X}}_{k}=\mathbf{Z}_{k}V^{T}_{1:k}=\mathbf{X}V_{1:k}V_{1: k}^{T}.$ ## Derivation of the Principal Components ### PCA as Variance-Maximizing Projection Assuming $\mathbf{X}$ has its columns centered at $0$, PCA can be derived by *finding the most "interesting" way of orthogonally projecting the data* in the sense that it finds unit vectors $\nu_{1},\dots$ that: - Each maximizes $\| \mathbf{X}\nu_{i} \|^{2}$ (the largest sample variance, since $\mathbf{X}$ is centered), subject to being orthogonal to all $\nu_{j<i}$ that comes before it. - Here those vectors are $\nu$ (the Greek letter), but in fact the solution is just $v$, the columns of $V$ (up to signs). In this example, the data is 2D: the population size and advertisement spending of $100$ cities. ![[PCA2D.png#invert|center|w80]] Therefore, the principal components are directions along which $\mathbf{X}$ varies the most. This has a few caveats: - It treats each column of $\mathbf{X}$ "equally", so its results are not preserved if one column is scaled up/down (e.g. via a unit conversion). A common practice is to *normalize each column to have a standard deviation of $1$*. This is equivalent to replacing $S=\mathrm{Cov}(\mathbf{X})=X^{T}X/(n-1)$ with $R=\mathrm{Corr}(\mathbf{X})$. - If the projection is used as a preprocessing for supervised tasks like regression/classification, *there is no guarantee that large variance in a predictor $\mathbf{X}_{:, j}$ is relevant for the response*. > [!theorem|*] PCA as Variance-Maximizing Orthogonal Map > The maximum-variance orthogonal projection$\nu_{i}=\underset{\nu}{\arg\max}~\| \nu^{T}S\nu \| , ~ ~\text{subject to }\forall j <i:\nu_{i} \perp \nu_{j}$ > is solved by the columns of $V$, i.e. $\text{for }i=1,\dots,p:~\nu_{i}=v_{i}.$ > [!exposition]- Derivation/proof of the result > This involves the sample covariance matrix $S=\mathbf{X}^{T}\mathbf{X} / (n-1)$, as an orthogonal projection $\mathbf{X} \nu$ has variance $\nu^{T}S\nu$. > > Since $S$ is positive definite, this optimization can happen in the direction of their eigenvectors: let $ S= VD^{T}DV^{T}$be its eigen-decomposition, where $V,D$ are the same matrices from the full SVD of $\mathbf{X}$. Then the (unconstrained) objective becomes $\underset{\nu:\| \nu \| =1}{\arg\max}~ \nu^{T} V D^{T}DV^{T}\nu = \| DV^{T}\nu \|^{2} \cdots$ > > However, since *$\nu \mapsto V^{T}\nu=: \tilde{\nu}$ preserves both orthogonality and unit length*, we can optimize for $\tilde{\nu}:= V^{T}\nu$, and undoing the transformation (by left-applying $V$) gives optimizers of the original objective: $\cdots=V\left(\underset{\tilde{\nu}:\| \tilde{\nu} \|=1 }{\arg\max}~\left\| D\tilde{\nu} \right\|^{2}\right).$ > > *First principal component*: since $D$ is diagonal, the first maximizer is just $\tilde{\nu}_{1}=(1,0,\dots,0)$ (here we assumed the singular values in $D$ are in descending order). Transforming back by left-applying $V$ gives $\nu_{1}=v_{1}$, the first column of $v$. > > *Successive principal components*: imposing orthogonality on $\tilde{\nu}_{2},\dots$ with previous $\tilde{\nu}s forces them to be $\tilde{\nu}_{i}=(0,\dots,0,\underset{i\text{th}}{1},0,\dots,0).$Transforming back gives $\nu_{i}=V\tilde{v}_{i}=v_{i}.$ Following the exposition above, the eigenvalues $\lambda_{1},\dots,\lambda_{p}$ of $S$ (i.e. squared singular values of $X$) are the **variance explained** by each of the PCs $v_{1},\dots,v_{p}$. They satisfy: $\sum_{j}\lambda_{j}=\mathrm{tr}(S)=\frac{ \mathrm{tr}(\mathbf{X}^{T}\mathbf{X})}{n-1}= \frac{1}{n-1} \sum_{j}\| \mathbf{X}_{:,j} \|^{2}=\frac{p}{n-1},$where the last equality assumes that the data has normalized columns. ### PCA as $l_{2}$ Risk Minimizer On the other way around, instead of maximizing the variance of their projections, the PCs equivalently minimize the variance left behind: > [!theorem|*] PCA as Risk-Minimizing Encoder > In the space of rank-$K$ ($1\le K \le p$) orthogonal encoders A where a vector $\mathbf{x} \in \mathbb{R}^{1\times p}$ is mapped to $\hat{\mathbf{x}}=A A^{T}\mathbf{x}$, the empirical $l_{2}$ risk $\hat{R}_{A}=\sum_{i=1}^{n}\| \mathbf{X}_{i,:}-A A^{T}\mathbf{X}_{i,:} \|^{2} $is minimized by the first $K$ columns of $V$: $V_{:, 1:K}=\underset{A}{\arg\min}~\hat{R}_{A}.$